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UNCOVERING THE RIFFLED INDEPENDENCE 
STRUCTURE OF RANKINGS 

By Jonathan Huang and Carlos Guestrin 
Carnegie Mellon University 

Representing distributions over permutations can be a daunting 
task due to the fact that the number of permutations of n objects 
scales factorially in n. One recent way that has been used to re- 
duce storage complexity has been to exploit probabilistic indepen- 
dence, but as we argue, full independence assumptions impose strong 
sparsity constraints on distributions and are unsuitable for model- 
ing rankings. We identify a novel class of independence structures, 
called riffled independence^ encompassing a more expressive family 
of distributions while retaining many of the properties necessary for 
performing efficient inference and reducing sample complexity. In rif- 
fled independence, one draws two permutations independently, then 
performs the riffle shuffle^ common in card games, to combine the 
two permutations to form a single permutation. Within the con- 
text of ranking, riffied independence corresponds to ranking disjoint 
sets of objects independently, then interleaving those rankings. In 
this paper, we provide a formal introduction to riffied independence 
and present algorithms for using riffied independence within Fourier- 
theoretic frameworks which have been explored by a number of recent 
papers. Additionally, we propose an automated method for discover- 
ing sets of items which are riffie independent from a training set of 
rankings. We show that our clustering-like algorithms can be used 
to discover meaningful latent coalitions from real preference ranking 
datasets and to learn the structure of hierarchically decomposable 
models based on riffied independence. 

1. Introduction. Ranked data appears ubiquitously in various statis- 
tics and machine learning application domains. Rankings are useful, for ex- 
ample, in reasoning about preference lists in surveys [Kamishima, 2003], 
search results in information retrieval applications [M. Sun, 2010], and ballots 
in certain elections [Diaconis, 1989] and even the ordering of topics and para- 
graphs within a document [Chen et al.^ 2009]. The problem of building sta- 
tistical models on rankings has thus been an important research topic in the 
learning community. As with many challenging learning problems, one must 
contend with an intractably large state space when dealing with rankings 
since there are n\ ways to rank n objects. In building a statistical model over 
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rankings, simple (yet flexible) models are therefore preferable because they 
are typically more computationally tractable and less prone to overfltting. 

A popular and highly successful approach for achieving such simplicity 
for distributions involving large collections of interdependent variables has 
been to exploit conditional independence structures (e.g., naive Bayes, tree, 
Markov models). With ranking problems, however, independence-based re- 
lations are typically trickier to exploit due to the so-called mutual exclusivity 
constraints which constrain any two items to map to different ranks in any 
given ranking. 

In this paper, we present a novel, relaxed notion of independence, called 
riffled independence^ in which one ranks disjoint subsets of items indepen- 
dently, then interleaves the subset rankings to form a joint ranking of the 
item set. For example, if one ranks a set of food items containing fruits 
and vegetables by preference, then one might flrst rank the vegetable and 
fruit sets separately, then interleave the two rankings to form a ranking for 
the full item set. Riflled independence appears naturally in many ranked 
datasets — as we show, political coalitions in elections which use the STV 
(single transferable vote) voting mechanism typically lead to pronounced 
riffled independence constraints in the vote histograms. 

Chaining the interleaving operations recursively leads to a simple, inter- 
pretable class of models over rankings, not unlike graphical models. We 
present methods for learning the parameters of such models and for esti- 
mating their structure. 

The following is an outline of our main contributions as well as a roadmap 
for the sections ahead. ^ 

• Section 2 gives a broad overview of several approaches for modeling 
probability distributions over permutations. In particular, we summa- 
rize the results of Huang et al. [2009], which studied probabilistic in- 
dependence relations in distributions on permutations. 

• In Section 3, we introduce our main contribution: an intuitive, novel 
generalization of the notion of independence for permutations, riffled 
independence^ based on interleaving independent rankings of subsets of 
items. We show riffled independence to be a more appropriate notion 
of independence for ranked data and exhibit evidence that riffle in- 
dependence relations can approximately hold in real ranked datasets. 
We also discuss ideas for exploiting riffled independence relations in 
a distribution to reduce sample complexity and to perform efficient 

^ This paper is an extended presentation of our previous papers [Huang and Guestrin, 
2009a], which was the first introduction of riffled independence, and [Huang and Guestrin, 
2010], which studied hierarchical models based on riffle independent decompositions. 
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inference. 

• Within the same section, we introduce a novel family of distributions 
over the set of interleavings of two item sets, called biased riffle shuffles^ 
that are useful in the context of riffled independence. We propose an 
efficient recursive procedure for computing the Fourier transform of 
these biased riffle shuffle distributions, 

• In Section 5, we discuss the problem of estimating model parameters of 
a riffle independent model from ranking data, and computing various 
statistics from model parameters. To perform such computations in a 
scalable way, we develop algorithms that can be used in the Fourier- 
theoretic framework of Kondor, Howard and Jebara [2007], Huang et al. 
[2009], and Huang, Guestrin and Guibas [2009b] for joining riffle in- 
dependent factors (Riffle Join) ^ and for teasing apart the riffle inde- 
pendent factors from a joint (RiffleSplit) ^ and provide theoretical and 
empirical evidence that our algorithms perform well. 

• We use Section 6 to define a family of simple and interpretable, yet 
flexible distributions over rankings, called hierarchical riffle indepen- 
dent models, in which subsets of items are iteratively interleaved into 
larger and larger subsets in a recursive stagewise fashion. 

• Sections 7, 8, and 9 tackle the problem of structure learning for our 
riffle independent models. In Section 7, we propose a method for finding 
the partitioning of the item set such that the subsets of the partition 
are as close to riffle independent as possible. In particular, we propose 
a novel objective for quantifying the degree to which two subsets are 
riffle independent to each other. In Section 8 and 9 we apply our 
partitioning algorithm to perform model selection from training data 
in polynomial time, without having to exhaustively search over the 
exponentially large space of hierarchical structures. 

• Finally in Section 11, we apply our algorithms to a number of datasets 
both simulated and real in order to validate our methods and assump- 
tions. We show that our methods are indeed effective, and apply them 
in particular to various voting and preference ranking datasets. 

2. Distributions on rankings. In this paper, we will be concerned 
with distributions over rankings. A ranking a — (cr(l), . . . ,cr(n)) is a one- 
to-one association between n items and ranks, where cr(j) = i means that 
the j^^ item is assigned rank i under a. By convention, we will think of low 
ranked items as being preferred over higher ranked items (thus, ranking an 
item in first place means that it is the most preferred out of all items). We 
will also refer to a ranking a by its inverse, [a~-'^(l), . . . , a~^{n)\ (called an 
ordering and denoted with double brackets instead of parentheses), where 
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(J~^{i) = j also means that the j^^ item is assigned rank i under a. The 
reason for using both notations is due to the fact that certain concepts will 
be more intuitive to express using either the ranking or ordering notation. 

Example 1. As a running example in this paper, we will consider rank- 
ing a small list of 6 items consisting of fruits and vegetables enumerated 
below: 



The ranking a = (3,1,5,6,2,4) means, for example, that Corn is ranked 
third. Peas is ranked first. Lemon is ranked fifth, and so on. In ordering 
notation, the same ranking is expressed as: a = [P, F, C, G, L, OJ. Finally 
we will use a (3) = cr(L) — ^ to denote the rank of the third item. Lemon. 

Permutations and the symmetric group. Rankings are similar to permuta- 
tions^ which are 1-1 mappings from the set {1, . . . ,n} into itself, the subtle 
difference being that rankings map between two different sets of size n. In 
this paper, we will use the same notation for permutations and rankings, 
but use permutations to refer to (1-1) functions which rearrange the order- 
ing of the item set or the ranks. If r is a permutation of the set of ranks, 
then then given a ranking a, one can rearrange the ranks by left-composing 
with T. Thus, the ranking rcr maps item i to rank t((j(z)). On the other 
hand, if r is a permutation of the item set, one can rearrange the item set 
by right-composing with . Thus, if item j was relabeled as item i — r{j)^ 
then a{r~^{i)) returns the rank of item j with respect to the original item 
ordering. Finally, we note that the composition of any two permutations is 
itself a permutation, and the collection of all n! permutations forms a group, 
commonly known as the symmetric group, or 5^.^ 

A distribution /i(cr), defined over the set of rankings or permutations can 
be viewed as a joint distribution over the n variables (cr(l), . . . , cr(n)) (where 
a{j) G {1, . . . ,^}), subject to mutual exclusivity constraints which stipulate 
that two objects cannot simultaneously map to the same rank, or alterna- 
tively, that two ranks cannot simultaneously be occupied by the same object 
{h(a(i) = cr(j)) = whenever i ^ j). 

Example 2 (APA election data). ^45 a running example throughout the 
paper (in addition to the fruits and vegetables), we will analyze the well known 

^We will sometimes abusively denote the set of rankings by Sn- Strictly speaking, 
however, the rankings are not a group, and instead one says that Sn acts faithfully on 
rankings. 



L Corn (C) 
4. Orange (O) 



2. Peas (P) 
5. Fig (F) 



3. Lemon (L) 
6. Grapes (G) 
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Fig 1. APA (American Psyochological Association) election data, (a) vote distribution: 
percentage of votes for each of 5\ = 120 possible rankings — the mode of the distribution is 
a = (2,3, 1,5,4). (b) Matrix of first order marginals: the {i,jY^ entry reflects the number 
of voters who ranked candidate j in the i*^ rank. 



APA election dataset that was first used by Diaconis [1988] and has since been 
analyzed in a number of ranking studies. The APA dataset is a collection of 
5738 ballots from a 1980 presidential election of the American Psychologi- 
cal Association where members rank ordered five candidates from favorite to 
least favorite. The names of the five candidates that year were (1) William 
Bevan, (2) Ira Iscoe, (3) Charles Kiesler, (4) Max Siegle, and (5) Logan 
Wright [Marden, 1995]. 

Since there are five candidates, there are 5! = 120 possible rankings. In 
Figure 1(a) we plot the proportion of votes that each ranking received. In- 
terestingly, instead of concentrating at just a small set of rankings, the vote 
distribution in the APA dataset is fairly diffuse with every ranking receiving 
some number of votes. The mode of the vote distribution occurs at the rank- 
ing a = (2,3,1,5,4) = {C. Kiesler, W. Bevan.I. Iscoe, L. Wright, M. Sieglej 
with 186 votes. 

For interpretability, we also visualize the matrix of first-order marginals 
in which the (i^j) entry represents the number of voters who assigned rank i 
to candidate j. Figure 1(b) represents the first-order matrix using grayscale 
levels to represent numbers of voters. What can be seen is that overall, can- 
didate 3 (C. Kiesler) received the highest number of votes for rank 1 (and 
incidentally, won the election). The vote distribution gives us a story that 
goes far deeper than simply telling us who the winner was, however. Diaco- 
nis [1988], for example, noticed that candidate 3 also had a significant '%ate^^ 
vote — a good number of voters placed him in the last rank. Throughout this 
paper, we will let this story unfold via a series of examples based on the APA 
dataset. 



6 



J. HUANG ET AL. 



2.1. Dealing with factorial possibilities. The fact that there are factoriahy 
many possible rankings poses a number of significant chahenges for learning 
and inference. First, there is no way to tractably represent arbitrary distribu- 
tions over rankings for large n. Storing an array of 12! doubles, for example, 
requires roughly 14 gigabytes of storage, which is beyond the RAM capacity 
of a typical modern PC. Second, the naive algorithmic complexity of common 
probabilistic operations is also intractable for such distributions. Computing 
the marginal probability, h(a(i) < cr(j)), that item i is preferred to item j, 
for example, requires a summation over 0((n — 2))!) elements. Finally, even 
if storage and computation issues were resolved, one would still have sam- 
ple complexity issues to contend with — for nontrivial n, it is impractical 
to hope that each of the n\ possible rankings would appear even once in a 
training set of rankings. The only existing datasets in which every possible 
ranking is realized are those for which n < 5, and in fact, the APA dataset 
(Example 2) is the only such dataset for n = 5 that we are aware of. 

The quest for exploitable problem structure has led researchers in machine 
learning and related fields to consider a number of possibilities including 
distribution sparsity [Farias, Jagabathula and Shah, 2009; Jagabathula and 
Shah, 2008; Reid, 1979], exponential family parameterizations [Helmbold 
and Warmuth, 2007; Lebanon and Mao, 2008; Meila et a/., 2007; Petterson 
et al.^ 2009], algebraic/ Fourier structure [Huang, Guestrin and Guibas, 2007, 
2009b; Kondor and Borgwardt, 2008; Kondor, Howard and Jebara, 2007], and 
probabilistic independence [Huang et a/., 2009]. We briefly summarize several 
of these approaches in the following. 

Parametric models. We will not be able to do justice to the sheer volume of 
previous work on parametric ranking models. Parametric probabilistic mod- 
els over the space of rankings have a rich tradition in statistics, [Fligner 
and Verducci, 1986, 1988; Guiver and Snelson, 2009; Mallows, 1957; Mar- 
den, 1995; Meila et a/., 2007; Plackett, 1975; Thurstone, 1927], and to this 
day, researchers continue to expand upon this body of work. For example, 
the well known Mallows model (which we will discuss in more detail in Sec- 
tion 6), which is often thought of as an analogy of the normal distribution 
for permutations, parameterizes a distribution with a "mean" permutation 
and a precision/spread parameter. 

The models proposed in this paper generalize some of the classical models 
from the statistical ranking literature, allowing for more expressive distri- 
butions to be captured. At the same time, our methods form a conceptual 
bridge to popular models (i.e., graphical models) used in machine learning 
which, rather than relying on a prespecified parametric form, simply work 
within a family of distributions that are consistent with some set of condi- 
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tional independence assumptions [Koller and Friedman, 2009]. 

Sparse methods. Sparse methods for summarizing distributions range from 
older ad-hoc approaches such as maintaining fc-best hypotheses [Reid, 1979] 
to the more updated compressed sensing inspired approaches discussed in [Farias, 
Jagabathula and Shah, 2009; Jagabathula and Shah, 2008]. Such approaches 
assume that there are at most k permutations which own ah (or almost all) 
of the probability mass, where k scales either sublinearly or as a low degree 
polynomial in n. While sparse distributions have been successfully applied 
in certain tracking domains, we argue that they are often less suitable in 
ranking problems where it might be necessary to model indifference over a 
large subset of objects.^ If one is approximately indifferent among a subset of 
k objects, then there are at least A:! rankings with nonzero probability mass. 
As an example, one can see that the APA vote distribution (Figure 1(a)) 
is clearly not a sparse distribution, with each ranking having received some 
nonzero number of votes. 

Fourier-based (low- order) methods. Another recent thread of research has 
centered around Fourier-based methods which maintain a set of low-order 
summary statistics [Diaconis, 1988; Huang, Guestrin and Guibas, 2009b; 
Kondor, 2008; Shin et al.^ 2005]. The first- order summary^ for example, stores 
a marginal probability of the form h{a : a(j) = i) for every pair (z, j) and 
thus requires storing a matrix of only O(n^) numbers. In our fruits/ vegetables 
example, we might store the probability that Figs are ranked first, or the 
probability that Peas is ranked last. 

Example 3 (APA election data (continued)). In the following matrix, 
we record the first order matrix computed from the histogram of votes in the 
APA election example (also visualized using grayscale levels in Figure 1(b)). 
Dividing each number by the total number of votes would yield a matrix of 
first order marginal probabilities. 

. Siegle L. Wright 
TT72 1129 
972 1210 
1089 1128 
1164 1106 
1341 1165 

^ In some situations, particularly when one is interested primarily in accurately cap- 
turing a loss or payoff function instead of raw ranking probabilities, it can suffice to use 
a sparse proxy distribution even if the true underlying distribution is not itself sparse. 
See, for example, [Farias, Jagabathula and Shah, 2009; Helmbold and Warmuth, 2007] for 
details. 



Rank 1 
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Rank 3 
Rank 4 
Rank 5 
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1519 
1313 
1002 
851 



775 
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1415 
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1055 
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More generally, one might store -order marginals, which are marginal 
probabilities of 5-tuples. The second-order marginals, for example, take the 
form h{a : a{k,£) = {hj))-> (perhaps encoding the joint probability that 
Grapes are ranked first, and Peas second) and require O(n^) storage. 

Low-order marginals turn out to be intimately related to a generalized 
form of Fourier analysis. Generalized Fourier transforms for functions on 
permutations have been studied for several decades now primarily by Persi 
Diaconis and his collaborators [Clausen and Baum, 1993; Diaconis, 1988; 
Maslen, 1998; Rockmore, 2000; Terras, 1999]. Low-order marginals corre- 
spond, in a certain sense, to the low-frequency Fourier coefficients of a distri- 
bution over permutations. For example, the first-order matrix of h{a) can be 
reconstructed exactly from 0{v?) of the lowest frequency Fourier coefficients 
of /i(cr), and the second-order matrix from 0{n^) of the lowest frequency 
Fourier coefficients. From a Fourier theoretic perspective, one sees that low 
order marginals are not just a reasonable way of summarizing a distribution, 
but can actually be viewed as a principled "low frequency" approximation 
thereof. In contrast with sparse methods, Fourier-based methods handle dif- 
fuse distributions well but are not easily scalable without making aggressive 
independence assumptions [Huang et al, 2009] since, in general, one requires 
0(n^^) coefficients to exactly reconstruct 5^^-order marginals, which quickly 
becomes intractable for moderately large n. 

2.2. Fully independent subsets of items. To scale to larger problems, Huang 
et al. [2009] demonstrated that, by exploiting probabilistic independence, one 
could dramatically improve the scalability of Fourier-based methods, e.g., 
for tracking problems, since confusion in data association only occurs over 
small independent subgroups of objects in many problems. Probabilistic in- 
dependence assumptions on the symmetric group can simply be stated as 
follows. Consider a distribution h defined over Sfi. Let A be a p-subset of 
{1, . . . , n}, say, {1, . . . ,p} and let B be its complement 1, • • • , ^}) with 

size q = n — p. We say that cf{A) = (cr(l), (j(2), . . . , and o-{B) = 

(a(p + 1), . . . , cr(n)) are independent if 

(2.1) h{a) = /((7(1), (7(2), . . . , a{p)) • g{a{p + 1), . . . , a{n)). 

Storing the parameters for the above distribution requires keeping 0{p\ -^q\) 
probabilities instead of the much larger 0(n!) size required for general dis- 
tributions. Of course, 0{p\ + q\) can still be quite large. Typically, one de- 
composes the distribution recursively and stores factors exactly for small 
enough factors, or compresses factors using Fourier coefficients (but using 
higher frequency terms than what would be possible without the indepen- 
dence assumption). In order to exploit probabilistic independence in the 
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Fig 2. Example first-order matrices with A = {1,2,3}; B = {4,5,6} fully independent, 
where black means h{a : cr(j) = i) = 0. In each case, there is some 3-subset A' which A is 
constrained to map to with probability one. Notice that, with respect to some rearranging 
of the rows, independence imposes a block- diagonal structure on first-order matrices. 
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Fig 3. Approximating the APA vote distribution by a factored distribution in which candi- 
date 3 is independent of candidates {1,2,4,5}. (a) in thick gray, the true distribution, in 
dotted purple, the approximate distribution. Notice that the factored distribution assigns 
zero probability to most permutations, (b) matrix of first order marginals of the approxi- 
mating distribution. 



Fourier domain, Huang et al. [2009] proposed algorithms for joining fac- 
tors and splitting distributions into independent components in the Fourier 
domain. 

Despite its utility for many tracking problems, however, we argue that the 
independence assumption on permutations implies a rather restrictive con- 
straint on distributions, rendering independence highly unrealistic in ranking 
applications. In particular, using the mutual exclusivity property, it can be 
shown [Huang et al.., 2009] that, if (j{A) and cr{B) are independent, then A 
and B are not allowed to map to the same ranks. That is, for some fixed 
p-subset A' C {1, . . . , n}, cr{A) is a permutation of elements in A^ and <j{B) 
is a permutation of its complement, S', with probability 1. 

Example 4. Continuing with our vegetable /fruit example with n — if 
the vegetable and fruit rankings, 
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(J A = [(T{Corn)^a{Peas)\^ and as = [a (Lemons) , a (Oranges) , a (Figs) , a (Grapes)], 

are known to be independent. Then for N — {1,2}^ the vegetables occupy 
the first and second ranks with probability one, and the fruits occupy ranks 
B' — {3,4,5,6} with probability one, reflecting that vegetables are always 
preferred over fruits according to this distribution. 

Huang et al. [2009] refer to this restrictive constraint as the first- order con- 
dition because of the block structure imposed upon first-order marginals (see 
Figure 2). In sports tracking, permutations represent the mapping between 
the identities of players with positions on the field, and in such settings, the 
first-order condition might say, quite reasonably, that there is potential iden- 
tity confusion within tracks for the red team and within tracks for the blue 
team but no confusion between the two teams. In our ranking example how- 
ever, the first-order condition forces the probability of any vegetable being 
in third place to be zero, even though both vegetables will, in general, have 
nonzero marginal probability of being in second place, which seems quite 
unrealistic. 

Example 5 (APA election data (continued)). Consider approximating 
the APA vote distribution by a factorized distribution (as in Equation 2.1). 
In Figure 3, we plot (in solid purple) the factored distribution which is closest 
to the true distribution with respect to total variation distance. In our ap- 
proximation, candidate 3 is constrained to be independent of the remaining 
four candidates and maps to rank 1 with probability 1. 

While capturing the fact that the '^winner^^ of the election should be candi- 
date 3, the fully factored distribution can be seen to be a poor approximation, 
assigning zero probability to most permutations even if all permutations re- 
ceived a positive number of votes. Since the support of the true distribution 
is not contained within the support of the approximation, the KL divergence, 
DKL{htrue',happrox) infinite. 

In the next section, we overcome the restrictive first-order condition with 
the more flexible notion of riffled independence. 

3. Riffled independence: definitions and examples. The riffle (or 
dovetail) shuffle [Bayer and Diaconis, 1992] is perhaps the most commonly 
used method of card shuffling, in which one cuts a deck of n cards into two 
piles, A — {1, . . . ,p} and B — {p + 1, . . . , n}, with size p and q — n — p., 
respectively, and successively drops the cards, one by one, so that the two 
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Probability of [[Fruit, Vegetable, Vegetable, 
Fruit, Vegetable, Vegetable]] interleaving 



Hnnnnnnn 



(2,4)-interleavings 



(a) 



(b) 



Fig 4. (a) Photograph of the riffle shuffle executed on a standard deck of cards; (h) Pictorial 
example of a {2^ A) -interleaving distribution, with red cards (offset to the left) denoting 
Vegetables, and blue cards (offset to the right) denoting Fruits. 



piles become interleaved (see Figure 4(a)) into a single deck again. Inspired 
by the riffle shuffle, we present a novel relaxation of the full independence 
assumption, which we call riffled independence. Rankings that are riffle in- 
dependent are formed by independently selecting rankings for two disjoint 
subsets of objects, then interleaving the two rankings using a riffle shuffle to 
form a final ranking over all objects. Intuitively, riffled independence models 
complex relationships within each set A and B while allowing correlations 
between the sets to be modeled only through a constrained form of shuffling. 

Example 6. Consider generating a ranking of vegetables and fruits. We 
might first 'cut the deck^ into two piles, a pile of vegetables (A) and a pile 
of fruits (B), and in a first stage, independently decide how to rank each 
pile. For example, within vegetables, we might decide that Peas are preferred 
to Corn: |P, C] = \Peas^Corn\. Similarly, within fruits, we might decide 
on the following ranking: |L,F,G,0] = \Lemons ^ Figs^ Grapes^ Or anges\ 
(Lemons preferred over Figs, Figs preferred over Grapes, Grapes preferred 
over Oranges). 

In the second stage of our model, the fruit and vegetable rankings are 
interleaved to form a full preference ranking over all six items. For example, 
if the interleaving is given by: {Veg, Fruit, Fruit, Fruit, Veg, Fruit}, then 
the resulting full ranking is: 

a = {Peas, Lemons, Figs, Grapes, Gorn, Oranges}. 

3.1. Convolution based definition of riffled independence. There are two 
ways to define riffled independence, and, we will first provide a definition 
using convolutions, a view inspired by our card shuffling intuitions. Mathe- 
matically, shuffles are modeled as random walks on the symmetric group. The 
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ranking a' after a shuffle is generated from the ranking prior to that shuffle, 
cr, by drawing a permutation, r from an interleaving distribution m{T)^ and 
setting a' — ra (the composition of the mapping r with a). Given the distri- 
bution over cr, we can find the distribution h(a') after the shuffle via the 
formula: h{a^) = '^{ar-a'=Ta}^i^)^^i^)' This operation which combines 
the distributions m and h is commonly known as convolution: 

Definition 7. Let m and be probability distributions on Sn- The 
convolution of the distributions is the function: [m * h^]{o-) = J^neSn ^(^) * 
h{7T~^a). We use the * symbol to denote the convolution operation. Note 
that * is not in general commutative (hence, m^h' ^ h' ^m). 

Besides the riffle shuffle, there are a number of different shuffling strategies 
— the pairwise shuffle, for example, simply selects two cards at random and 
swaps them. The question then, is what are interleaving shuffling distribu- 
tions m that correspond to riffle shuffles? To answer this question, we use 
the distinguishing property of the riffle shuffle, that, after cutting the deck 
into two piles of size p and q = n — p, it must preserve the relative ranking 
relations within each pile. Thus, if the i^^ card appears above the j^^ card 
in one of the piles, then after shuffling, the i*^ card remains above the j*^ 
card. In our example, relative rank preservation says that if Peas is preferred 
over Corn prior to shuffling, they continue to be preferred over Corn after 
shuffling. Any allowable riffle shuffling distribution must therefore assign zero 
probability to permutations which do not preserve relative ranking relations. 
As it turns out, the set of permutations which do preserve these relations 
have a simple description. 

Definition 8 (Interleaving distributions). The {p^ q) -interleaving s are 
defined as the following set: 

= {t eSn : r(l) < r(2) < • • • < r(p), and r(p + 1) < r(p + 2) < • • • < r(n)}. 

A distribution rup^q on Sn is called an interleaving distribution if it assigns 
nonzero probability only to elements in 

The (^, -inter leavings can be shown to preserve relative ranking relations 
within each of the subsets A — {1, . . . ,p} and B = {p + 1, . . . , n} upon 
multiplication: 

Lemma 9. Let ij G A = {1, . . . ,_p} (or iJeB = {p+l,...,n}) and 
let T be any {p^ q) -interleaving in ftp^q. Then i < j if and only if T(i) < r(j) 
(i.e., permutations in Qp^q preserve relative ranking relations). 
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Example 10. In our vegetable/fruits example, In our vegetable/ fruits 
example, we have n — Q, p — 2 (two vegetables, four fruits). The set of 
{2,4:)-interleavings is: 



(1, 2, 3, 4, 5, 6), 
(2, 3, 1, 4, 5, 6), 
(3, 5, 1, 2, 4, 5), 



(1, 3, 2, 4, 5, 6), 
(2, 4, 1,3, 5, 6), 
(3, 6, 1, 2, 4, 5), 



or written in ordering notation. 



^2,4 



IVVFFFF], 
IFVVFFF], 
IFFVFVF], 



IVFVFFF], 
IFVFVFF], 
IFFVFFV], 



(1,4, 2, 3, 5, 6), 
(2, 5, 1,3, 4, 6), 
(4, 5, 1, 2, 3, 6), 



IVFFVFF], 

IFVFFVF], 
IFFFVVF], 



(1, 5, 2, 3, 4, 6), 
(2, 6, 1, 3, 4, 5), 
(4, 6, 1, 2, 3, 5), 



IVFFFVF], 

IFVFFFV], 
IFFFVFV], 



(1, 6, 2, 3, 4, 5), 
(3, 4, 1, 2, 5, 6), 
(5,6, 1,2,3,4) 



Note that the number of possible interleavings is \Qp 



IVFFFFV], ] 
IFFVVFF], I 
IFFFFVV] J 

= O = O 



6!/(2!4!) = 15. One possible riffle shuffling distribution on Sq might, for 
example, assign uniform probability (rri^^l^ {a) — 1/15^) to each permuta- 
tion in fi2,4 Gind zero probability to everything else, reflecting indifference 
between vegetables and fruits. Figure 4(b) is a graphical example of a (2,4)- 
interleaving distribution. 

We now formally define our generalization of independence where a distri- 
bution which fully factors independently is allowed to undergo a single riffle 
shuffle. 



Definition 11 (Riffled independence). The subsets A = {1, . . . ,p} and 
= {p+ 1, . . . , n} are said to be riffte independent if /i = rup^q * {fA{cr{A)) • 
gB{cr(B)))^ with respect to some interleaving distribution rrip^q and distribu- 
tions fAidBi respectively. We will not ate the riffled independence relation as 
A ±^ and refer to fA^QB as relative ranking factors. 

Notice that without the additional convolution, the definition of riffled 
independence reduces to the fully independent case given by Equation 2.1. 

Example 12. Consider drawing a ranking from a riffle independent model. 
One starts with two piles of cards, A and B, stacked together in a deck. In 
our fruits /vegetables setting, if we always prefer vegetables to fruits, then the 
vegetables occupy positions {1, 2} and the fruits occupy positions {3, 4, 5, 6}. 
In the first step, rankings of each pile are drawn independent. For example, 
we might have the rankings: cr( Veg) — (2, 1) and a(Fruit) = (4, 6, 5, 3); con- 
stituting a draw from the fully independent model described in Section 2.2. In 
the second stage, the deck of cards is cut and interleaved by an independently 
selected element r E 1^2,4- For example, if: 

r = (2,3,1,4,5,6) = {Fruit, Veg, Veg, Fruit, Fruit, Fruitj, 



14 



J. HUANG ET AL. 



then the joint ranking is: 

r{a{Veg),a{Fruit)) = (2, 3, 1, 4, 5, 6)(2, 1, 4, 6, 5, 3) = (3, 2, 4, 6, 5, 1), 
= {Grapes, Peas, Corn, Lemon, Fig, Orange}. 

3.2. Alternative definition of riffled independence. It is possible to rewrite 
the definition of riffled independence so that it does not involve a convolu- 
tion. We first define functions which map a given full ranking to relative 
rankings and interleavings for A and B. 

Definition 13. 

• {Absolute ranks): Given a ranking a G and a subset A C {1, . . . , n}, 
o-(A) denotes the absolute ranks of items in A. 

• {Relative ranking map): Let 0a (c^") denote the ranks of items in A rel- 
ative to the set A. For example, in the ranking a = [P, L, F, G, C, O], 
the relative ranks of the vegetables is (I)a{o-) = [P, C] = [Peas, Corn}. 
Thus, while corn is ranked fflth in cr, it is ranked second in (f)A{o')- 
Similarly, the relative ranks of the fruits is (j)B{(^) — [L,F,G,0] = 
[Lemons, Figs^ Grapes^ Oranges\. 

• {Interleaving map): Likewise, let ta^b{(^) denote the way in which 
the sets A and B are interleaved by a. For example, using the same 
a as above, the interleaving of vegetables and fruits is ta^b{o') — 
{Veg, Fruit, Fruit, Fruit, Veg, Fruit}. In ranking notation (as opposed 
to ordering notation), ta^b can be written as (sort(a(74)), sort(cr(S))). 
Note that for every possible interleaving, r G ^p^q there are exactly 
p\ X q\ distinct permutations which are associated to r by the inter- 
leaving map. 

Using the above maps, the following lemma provides an algebraic expres- 
sion for how any permutation a can be uniquely decomposed into an in- 
terleaving composed with relative rankings of A and B, which have been 
"stacked" into one deck. 

Lemma 14. Let A = {1, . . . ,p}, and B = + 1, . . . , n}. Any ranking 
(J ^ Sn can be decomposed uniquely as an interleaving r G ^p^q composed with 
a ranking of the form (tt^, 7Vq+p), where TVp ^ Sp, TVq ^ Sq, and TVq^p means 
that the number p is added to every rank in TVq. Specifically, a = T{7Vp, 7Vq-\-p) 
with T = ta,b{o'), TTp = (f)A{o'), and TTq = (I>b{o') (Proof in Appendix). 

Lemma 14 shows that one can think of a triplet (r G ^p^q,ap G Sp,aq G 
Sq) as being coordinates which uniquely specify any ranking of items in 
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AU B. Using the decomposition, we can now state a second, perhaps more 
intuitive, definition of riffled independence in terms of the relative ranking 
and interleaving maps. 

Definition 15. Sets A and B are said to be riffle independent if and 
only if, for every a G Sn^ the joint distribution h factors as: 

(3.1) h{a) = m{TA,B(cr)) • /^(^^(cr)) • ^^(^^(cr)). 

Proposition 16. Definitions 11 and 15 are equivalent. 

Proof. Assume that A = {1, . . . ,p} and B = {j) + 1, . . . , n} are riffle 
independent with respect to Definition 11. We will show that Definition 15 
is also satisfied (the opposite direction will be similar). Therefore, we assume 
that h — rup^q * {/{(Ja) • g{(^B))- Note that /{(ta) ' d{^B) is supported on the 
subgroup Sp y. Sq = {a ^ Sn - 1 < cr(z) < whenever 1 < i < p}. 

Let a = {o-a^ctb) be any ranking. We will need to use a simple claim: 
consider the ranking; (where r G ^p^q)- Then r ■'^ a is an element of the 

subgroup Sp X Sq if and only if r = ta,b(o-). 

— mp,q{r) • [/ • g]{r~'^a), (since rup^q is supported on ^p,q) 

= mp,q{rA,Bicr)) • [/ • ^]((t^,s(c^))c^). 

(by the claim above and since f ■ g is supported on Sp x Sq) 
= mp,q{TA,B{cr)) • [/ • ^]((/)A(cr), (/)s(cr)), (by Lemma U) 
= mp,q{rA,B{(j)) ■ f{(l)A{(j)) ■ g{(l)B{cr)). {by independence of f ■ g) 

Thus, we have shown that Definition 15 has been satisfied as well. □ 

Discussion. We have presented two ways of thinking about riffled indepen- 
dence. Our first formulation, in terms of convolution, is motivated by the 
connections between riffled independence and card shuffling theory. As we 
show in Section 5, the convolution based view is also crucial for working 
with Fourier coefflcients of riffle independent distributions and analyzing the 
theoretical properties of riffled independence. Our second formulation on the 
other hand, shows the concept of riffled independence to be remarkably sim- 
ple — that the probability of a single ranking can be computed without 
summing over all rankings (required in convolution) — a fact which may not 
have been obvious from Definition 11. 
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Finally, for interested readers, the concept of riffled independence also 
has a simple and natural group theoretic description. By a fully factorized 
distribution, we refer to a distribution supported on the subg roup Sp X Sq^ 
which factors along the Sp and Sq "dimensions". As we have discussed, such 
sparse distributions are not appropriate for ranking applications, and one 
would like to work with distributions capable of placing nonzero probability 
mass on all rankings. In the case of the symmetric group, however, there 
is a third "missing dimension" — the coset space, Sn/(Sp x Sq). Thus, the 
natural extension of full independence is to randomize over a set of coset 
representatives of Sp x Sq^ what we have referred to in the above discussion 
as interleavings. The draws from each set, Sp^ Sq^ and Sn/{Sp x Sq) are then 
independent in the ordinary sense, and we say that the item sets A and B 
are riffle independent. 

Special cases. There are a number of special case distributions captured 
by the riffled independence model that are useful for honing intuition. We 
discuss these extreme cases in the following list. 

• {Uniform and delta distributions)'. Setting the interleaving distribution 
and both relative ranking factors to be uniform distributions yields the 
uniform distribution over all full rankings. Similarly, setting the same 
distributions to be delta distributions (which assign zero probability 
to all rankings but one) always yields a delta distribution. 

It is interesting to note that while A and B are always fully independent 
under a delta distribution, they are never independent under a uniform 
distribution. However, both uniform and delta distributions factor riffle 
independently with respect to any partitioning of the item set. Thus, 
not only is A = {1, . . . ,p} riffle independent S = + 1, . . . , n}, but 
in fact, any set A is riffle independent of its complement. 

• {Uniform interleaving distributions): Setting the interleaving distribu- 
tion to be uniform, as we will discuss more in detail later, reflects com- 
plete indifference between the sets A and even if / and g encode 
complex preferences within each set alone. 

• ( Uniform relative ranking factors)'. Setting the relative ranking factors, 
/ and g to be uniform distributions means that with respect to the joint 
distribution /i, all items in A are completely interchangeable amongst 
each other (as are all items in B). 

• {Delta interleaving distributions)'. Setting the interleaving distribution, 
TTip^g, to be a delta distribution on any of the (p, -interleavings in Q^p^q 
recovers the definition of ordinary probabilistic independence, and thus 
riffled independence is a strict generalization thereof (see Figure 2). 
Just as in the full independence regime, where the distributions / and 
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Fig 5. Approximating the APA vote distribution by riffle independent distributions, (a) ap- 
proximate distribution when candidate 2 is riffle independent of remaining candidates; (c) 
approximate distribution when candidate 3 is riffle independent of remaining candidates; 
(b) and (d) corresponding first order marginals of each approximate distribution. 



g are marginal distributions of absolute rankings of A and in the 
riffled independence regime, / and g can be thought of as marginal 
distributions of the relative rankings of item sets A and B. 
• {Delta relative ranking factor)'. On the other hand, if one of the relative 
ranking factors, say /, is a delta distribution and the other two distri- 
butions rup^q and g are uniform, then the resulting riffle independent 
distribution h can be thought of as an indicator function for the set 
of rankings that are consistent with one particular incomplete ranking 
(in which only the relative ranking of A has been specified). Such dis- 
tributions can be useful in practice when the input data comes in the 
form of incomplete rankings rather than full rankings. 

Example 17 (APA election data (continued)). Like the independence 
assumptions commonly used in naive Bayes models, we would rarely expect 
riffled independence to exactly hold in real data. Instead, it is more appro- 
priate to view riffled independence assumptions as a form of model bias that 
ensures learnability for small sample sizes, which as we have indicated, is 
almost always the case for distributions over rankings. 
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DrawRiffleUnif(;?, g, n) 




// (p-\- q = n) 


with prob q/n 




// drop from right pile 


(7 ^ DrawRiffleUnif(p, g — 1, n — 1) 




foreach i do ^ < 


' (T {i) \i i < n 
n \i i = n 






endif 






otherwise 




// drop from left pile 


cr ^ DrawRiffleUnif(p — 1, g, n — 1) 






G~ {%) if i < p 




foreach i do cr(i) ^ < 


n a i = p 
(7~ (i — 1) ii i > p 




endif 






return cr 







Algorithm 1: Recurrence for drawing a ~ rrip^q^ (Base case: return cr = [1] if 
n=l). 



Can we ever expect riffled independence to he manifested in a real dataset? 
In Figure 5(a), we plot (in dotted red) a riffle independent approximation to 
the true APA vote distribution (in thick gray) which is optimal with respect 
to KL-divergence (we will explain how to obtain the approximation in the 
remainder of the paper). The approximation in Figure 5(a) is obtained by 
assuming that the candidate set {1, 3, 4, 5} is riffle independent of {2}, and as 
can be seen, is quite accurate compared to the truth (with the KL-divergence 
from the true to the factored distribution being dxL — .0398^). Figure 5(b) 
exhibits the first order marginals of the approximating distribution, which can 
also visually be seen to be a faithful approximation (see Figure 1(b)). We will 
discuss the interpretation of the result further in Section 6. 

For comparison, we also display (in Figures 5(c) and 5(d)) the result of ap- 
proximating the true distribution by one in which candidate {i}, the winner, 
is riffle independent of the remaining candidate. The resulting approximation 
is inferior, and the lesson to be learned in the example is that finding the cor- 
rect/optimal partitioning of the item set is important in practice. We remark 
however, that the approximation obtained by factoring out candidate 3 is not 
a terrible approximation (especially on examining first order marginals), and 
that both approximations are far more accurate than the fully independent 
approximation showed earlier in Figure 3. The KL divergence from the true 
distribution to the factored distribution (with candidate 3 riffle independent 
of the remaining candidates) is dxL — .0841. 

3.3. Interleaving distributions. There is, in the general gnificant 
increase in storage required for riffled independence over fuh independence. 
In addition to the 0{p\ + q\) storage required for distributions / and ^, 



UNCOVERING THE RIFFLED INDEPENDENCE STRUCTURE OF RANKINGSI9 



20 




candidates candidates candidates candidates 



(a)a = (b)a=l/6 (c)a=l/3 (d) a = 1/2 




candidates candidates candidates 



(e) a = 2/3 (f ) a = 5/6 (g) « = 1 

Fig 6. First-order matrices with a deck of 20 cards, A = {1, . . . , 10}, B — {11, . . . , 20}, 
riffle independent and various settings of a. Compare these matrices to the fully inde- 
pendent first order marginal matrices of Figure 2 and note that here, the nonzero blocks 
are allowed to 'bleed' into zero regions. Setting a = or 1, however, recovers the fully 
independent case, where a subset of objects is preferred over the other with probability one. 



we now require storage for the nonzero terms of the riffle shuffling 

distribution rup^q. We now introduce a family of useful riffle shuffling dis- 
tributions which can be described using only a handful of parameters. The 
simplest riffle shuffling distribution is the uniform riffle shuffle., m^^q^ which 
assigns uniform probability to all (p, g)-interleavings and zero probability to 
all other elements in Sn- Used in the context of riffled independence, m^^q^ 
models potentially complex relations within A and but only captures the 
simplest possible correlations across subsets. We might, for example, have 
complex preference relations amongst vegetables and amongst fruits, but be 
completely indifferent with respect to the subsets, vegetables and fruits, as 
a whole. 

There is a simple recursive method for uniformly drawing (p, g)-interleavings. 
Starting with a deck of n cards cut into a left pile ({1, . . . and a right 
pile {{p + 1, . . . ,n}), pick one of the piles with probability proportional to 
its size {p/n for the left pile, q/n for the right) and drop the bottommost 
card, thus mapping either card p or card n to rank n. Then recurse on the 
n — 1 remaining undropped cards, drawing a — 1, -interleaving if the 
right pile was picked, or di (p^q — l)-interleaving if the left pile was picked. 
See Algorithm 1. 

It is natural to consider generalizations where one is preferentially biased 
towards dropping cards from the left hand over the right hand (or vice- versa). 
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We model this bias using a simple one-parameter family of distributions in 
which cards from the left and right piles drop with probability proportional 
to ap and (1 — a)q^ respectively, instead of p and q. We will refer to a as the 
bias parameter^ and the family of distributions parameterized by a as the 
biased rijfle shuffles.^ 

In the context of rankings, biased riffle shuffles provide a simple model for 
expressing groupwise preferences (or indifference) for an entire subset A over 
B or vice- versa. The bias parameter a can be thought of as a knob controlling 
the preference for one subset over the other, and might reflect, for example, 
a preference for fruits over vegetables, or perhaps indifference between the 
two subsets. Setting a = or 1 recovers the full independence assumption, 
preferring objects in A (vegetables) over objects in B (fruits) with probability 
one (or vice- versa), and setting a = .5, recovers the uniform riffle shuffle (see 
Fig. 6). Finally, there are a number of straightforward generalizations of 
the biased riffle shuffle that one can use to realize richer distributions. For 
example, a might depend on the number of cards that have been dropped 
from each pile (allowing perhaps, for distributions to prefer crunchy fruits 
over crunchy vegetables^ but soft vegetables over soft fruits). 

4. Exploiting structure for probabilistic inference. In this sec- 
tion, we discuss a number of basic properties of riffled independence, which 
show that certain probabilistic inference operations can be accomplished by 
operating on a single factor rather than the entire joint distribution. 

Upon knowing that A is riffle independent of an immediate conse- 
quence is that we can show, just as in the full independence regime, that 
conditioning operations on certain observations and MAP (maximum a pos- 
teriori) assignment problems decompose according to riffled independence 
structure. All of the following properties are straightforward to derive using 
the factorization in Definition 15. 

Proposition 18 (Probabilistic inference decompositions). 

• ( Conditioning^.- Consider prior and likelihood functions, hprior ci^d 
hiike, on Sn in which subsets A and B are riffle independent, with 

parameters {mprior') f prion 9 prior 

) and {miikejiike, giike) , respectively. 
Let denote the pointwise product operation between two functions. 
Then A and B are also riffle independent with respect to the poste- 
rior distribution under Bayes rule, which has interleaving distribution 

^The recurrence in Alg. 1 has appeared in various forms in literature [Bayer and Dia- 
conis, 1992]. We are the first to (1) use the recurrence to Fourier transform mp,g, and to 
(2) consider biased versions. The biased riffle shuffles in Fulman [1998] are not similar to 
our biased riffle shuffles. 
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mike&'mprior with relative ranking factors fnke&f prior and gukeQgprior, 
for A and B respectively. 

• (^MAP assignment^; Let A and B he riffle independent subsets. Con- 
sider the following permutations: 

TT* = argmax/^(7r), tt* = argmax^^(7r), r* = arg max mp^g(T). 

Then the mode ofh is r* composed with tt* = (tt*, tt*) (i.e., argmax^r h{a) — 

• (^EntropyJ; Consider riffle independent subsets A and B. The entropy 
of the joint distribution is given by: H[h] — H[mp^q] + H[fA] + H[gB]- 

Some ranked datasets come in the form of pairwise comparisons, with 
records of the form "object i is preferred to object f\ As a corohary to 
Proposition 18, we now argue that conditioning on these pairwise ranking 
hkehhood functions (that depend only on whether object i is preferred to 
object j) decomposes along riffled independence structures. The pairwise 
ranking model [Huang, Guestrin and Guibas, 2009b] for objects i and j, is 
defined over Sfi as! 

fukeia) = <5.(,)<.o)(a) = | ^ _ ^ otherwise ' ^ ^ ^ ^ ^' 

and reflects the fact that object i is preferred to object j (with probability 
If objects i and j both belong to one of the sets, say A, then only 
one factor requires an update using Bayes rule. If vegetables and fruits are 
riffle independent, for example, then less computation would be required to 
compare a vegetable against a vegetable than to compare a fruit against 
a vegetable. For example, the observation that Corn is preferred over Peas 
affects only the distribution, over vegetables. More formally, we state 
this intuitive corollary as follows: 

Corollary 19. Consider conditioning on the pairwise ranking model, 
flikei'^) — 5^(^)^^(j)(c'"); and suppose that A and B are riffle independent 
subsets with respect to the prior distribution hprior, ^'^th parameters {ruprior, 
f prior, Qprior)- If h j ^ then A and B are riffle independent with respect 
to the posterior distribution, whose parameters are identical to those of the 
prior, except for the relative ranking factor corresponding to A, which is 

fpost = f prior & S^(^^^^(^jy 

Proof sketch. First show that the subsets A and B are riffle indepen- 
dent with respect to the likelihood function, S'^^^^^^j^^ by equating the likeli- 
hood function to a product of a uniform interleaving distribution, mp^q-^ ^ and 
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relative ranking factor /a = for A, and a uniform relative ranking 

factor for B. Then apply Proposition 18. □ 

Let us compare the result of the corollary to what is possible with a fully 
factored distribution. If A and B were fully independent, then conditioning 
on any distribution which involved items in A (or only in B) would require 
only updating the factor associated with item set A. For example, if z G A, 
then first-order observations of the form "item i is in rank j" can be efficiently 
conditioned in the fully independent scenario. With riffied independence, 
it is not, in general, possible to condition on such first-order observations 
without modifying all of the 0(n!) parameters. However, as Corollary 19 
shows, pairwise comparisons involving z, j both in A (or both in B) can be 
performed exactly by updating either / (or g) without having to touch all 
0(n!) probabilities. 

5. Algorithms for a fixed partitioning of the item set. We have 
thus far covered a number of intuitive examples and properties of riffied 
independence. Given a set of rankings drawn from some distribution /i, we 
are now interested in estimating a number of statistical quantities, such as 
the parameters of a riffie independent model. In this section, we will assume a 
known structure (that the partitioning of the item set into subsets A and B is 
known), and given such a partitioning of the item set, we are interested in the 
problem of estimating parameters (which we will refer to as RijfleSplit)^ and 
the inverse problem of computing probabilities (or marginal probabilities) 
with given parameters (which we will refer to as RijfleJoin). 

RiffteSplit. In RiffieSplit (which we will also refer to as the parameter esti- 
mation problem), we would like to estimate various statistics of the relative 
ranking and interleaving distributions of a riffie independent distribution 
{^p,q^ /a, and qb)' Given a set of i.i.d. training examples, a^^\ . . . , a^'^\ 
we might, for example, want to estimate each raw probability (e.g., estimate 
mp^q{r) for each interleaving r). In general, we may be interested in estimat- 
ing more general statistics (e.g., what are the second order relative ranking 
probabilities of the set of fruits?). 

Since our variables are discrete, computing the maximum likelihood pa- 
rameter estimates consists of forming counts of the number of training exam- 
ples consistent with a given interleaving or relative ranking. Thus, the MLE 
parameters in our problem are simply given by the following formulas: 

m 

(5.1) m^^'^M « [r = rA,B(a«)] , 
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(5.2) 



(5.3) 



RiffteJoin. Having estimated parameters of a riffle independent distribu- 
tion, we would like to now compute various statistics of the data itself. In 
the simplest case, we are interested in estimating /i(cr), the joint probability 
of a single ranking, which can be evaluated simply by plugging parameter es- 
timates of rup^q^ and qb into our second definition of riffled independence 
(Definition 15). 

More generally however, we may be interested in knowing the low-order 
statistics of the data (e.g., the first order marginals, second order marginals, 
etc.), or related statistics (such as h{a{i) < cr(j))^ the probability that object 
i is preferred to object j). And typically for such low-order statistics, one 
must compute a sum over rankings. For example, to compute the probability 
that item j is ranked in position z, one must sum over (n — 1)! rankings: 



While Equation 5.4 may be feasible for small n (such as on the APA 
dataset), the sum quickly grows to be intractable for larger n. One of the 
main observations of the remainder of this section, however, is that low- 
order marginal probabilities of the joint distribution can always be computed 
directly from low-order marginal probabilities of the relative ranking and 
interleaving distributions without explicitly computing intractable sums. 

5.1. Fourier theoretic algorithms for riffled independence. We now present 
algorithms for working with riffled independence (solving the RiffleSplit and 
RiffleJoin problems) in the Fourier theoretic framework of Huang, Guestrin 
and Guibas [2009b]; Huang et al [2009]; Kondor, Howard and Jebara [2007]. 
The Fourier theoretic perspective of riffled independence presented here is 
valuable because it will allow us to work directly with low-order statistics 
instead of having to form the necessary raw probabilities first. Note that 
readers who are primarily interested in the structure learning can jump di- 
rectly to Section 6. 

We begin with a brief introduction to Fourier theoretic inference on per- 
mutations (see Huang, Guestrin and Guibas [2009b]; Kondor [2008] for a 
detailed exposition). Unlike its analog on the real line, the Fourier transform 
of a function on Sn takes the form of a collection of Fourier coefflcient ma- 
trices ordered with respect to frequency. Discussing the analog of frequency 



(5.4) 



h{a : a{j) = i) = ^ 



mp,q{TA,B{cr)) • (/A(0A(cr)) - Q b {(/) B (cr))) . 



cr : cr(j) = i 
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for functions on 5^, is beyond the scope of our paper, and, given a distri- 
bution /i, we simply index the Fourier coefficient matrices of /i as /iq, /^'i, 
. . . , hx ordered with respect to some measure of increasing complexity. We 
use h to denote the complete collection of Fourier coefficient matrices. One 
rough way to understand this complexity, as mentioned in Section 2, is by 
the fact that the low-frequency Fourier coefficient matrices of a distribution 
can be used to reconstruct low-order marginals. For example, the first-order 
matrix of marginals of h can always be reconstructed from the matrices ho 
and hi. As on the real line, many of the familiar properties of the Fourier 
transform continue to hold. The following are several basic properties used 
in this paper: 

Proposition 20 (Properties of the Fourier transform, Diaconis [1988]). 
Consider any f,g:Sn^^- 

• (Linearity) For any a, /3 G M; [af + /3g]i = afi + /3^i holds at all 
frequency levels i. 

• (Convolution) The Fourier transform of a convolution is a product of 
Fourier transforms: [f ^ g\i — fi -^i, for each frequency level i, where 
the operation • is matrix multiplication. 

• (Normalization) The first coefficient matrix, fo, is a scalar and equals 

A number of papers in recent years (Huang, Guestrin and Guibas [2007, 
2009b]; Huang et al. [2009]; Kondor, Howard and Jebara [2007]) have consid- 
ered approximating distributions over permutations using a truncated (ban- 
dlimited) set of Fourier coefficients and have proposed inference algorithms 
that operate on these Fourier coefficient matrices. For example, one can 
perform generic marginalization, Markov chain prediction, and conditioning 
operations using only Fourier coefficients without ever having to perform an 
inverse Fourier transform. 

In this section, we provide generalizations of the algorithms in Huang et al. 
[2009] that tackle the RiffieJoin and RiffieSplit problems. We will assume, 
without loss of generality that A = {1, . . . ,p} and S = + 1, . . . , n} (this 
assumption will be discarded in later sections). Although we begin each of 
the following discussions as if all of the Fourier coefficients are provided, 
we will be especially interested in algorithms that work well in cases where 
only a truncated set of Fourier coefficients are present, and where h is only 
approximately riffie independent. 

For both problems, we will rely on two Fourier domain algorithms intro- 
duced in Huang et al. [2009], Join and Splits as subroutines. Given indepen- 
dent factors f : Sp ^ and ^ : 5^ ^ R, Join returns the joint distribution 
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RiffleJoin(/, g, rn^q) 

input : Fourier transforms of /a, gsj and m (f,g,rn^ respectively) 
output: Fourier transform of the joint distribution, h 

h' = Join(/,^) ; 
foreach frequency level i do 
hi ^ [f^^q]i • h'i ; 

end 

return h ; 

Algorithm 2: Pseudocode for Riffle Join 



RiffleSplit(/i) 

input : Fourier transform of the empirical joint distribution h 
output: Fourier transform of MLE estimates of fAj 9b {f->9) 

foreach frequency level i do 
end 

[Ig] ^ Split(/?) ; 
Normalize / and g] 
return /, g; 

Algorithm 3: Pseudocode for RiffleSplit 

f • g. Conversely, given a distribution /i : 5^ ^ R, Split computes / and g 
by marginalizing over Sq or Sp^ respectively. For example, Split[/i] returns 
a function defined on 5^, and SPLlT[h](ap) = h{{ap,aq)). We will 

overload the Join/Split names to refer to both the ordinary and Fourier 
theoretic formulations of the same procedures. 

5.2. Riffle Join in the Fourier domain. Given the Fourier coefflcients of 
/, ^, and m, we can compute the Fourier coefflcients of h using Defini- 
tion 11 (our first definition) by applying the Join algorithm from Huang 
et al. [2009] and the Convolution Theorem (Proposition 20), which tells us 
that the Fourier transform of a convolution can be written as a pointwise 
product of Fourier transforms. To compute the /i^, the Fourier theoretic for- 
mulation of the RiffteJoin algorithm simply calls the Join algorithm on / 
and ^, and convolves the result by m (see Algorithm 2). 

In general, it may be intractable to Fourier transform the riffle shuffling 
distribution rup^q. However, there are some cases in which nip^q can be com- 
puted. For example, if nip^q is computed directly from a set of training ex- 
amples, then one can simply compute the desired Fourier coefflcients using 
the definition of the Fourier transform given in Huang, Guestrin and Guibas 
[2009b], which is tractable as long as the samples can be tractably stored 
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in memory. For the class of biased riffle shuffles that we discussed in Sec- 
tion 3, one can also efflciently compute the low-frequency terms of m^^^ 
by employing the recurrence relation in Algorithm 1. In particular, Algo- 
rithm 1 expresses a biased riffle shuffle on Sn as a linear combination of 
biased riffle shuffles on Sn-i- By invoking linearity of the Fourier transform 
(Proposition 20), one can efflciently compute via a dynamic program- 
ming approach quite reminiscent of Clausen's FFT (Fast Fourier transform) 
algorithm Clausen and Baum [1993]. We describe our algorithm in more de- 
tail in Appendix C. To the best of our knowledge, we are the first to compute 
the Fourier transform of riffle shuffling distributions. 

5.3. RiffleSplit in the Fourier domain. Given the Fourier coefflcients of 
a riffle independent distribution /i, we would like to tease apart the factors. 
In the following, we show how to recover the relative ranking distributions, 
/a and qb^ and defer the problem of recovering the interleaving distribution 
for Appendix C. 

From the Riffle Join algorithm, we saw that for each frequency level i^hi — 
[fn^q\^ -[f • g\i' The first solution to the splitting problem that might occur is 
to perform a deconvolution by multiplying each hi term by the inverse of the 
matrix [Tn^q\^ (to form [Tri^q]^^ - hi) and call the Split algorithm from Huang 
et al. [2009] on the result. Unfortunately, the matrix [fn^q\^ is, in general, non- 
invertible. Instead, our RiffleSplit algorithm left-multiplies each hi term by 
[ffh^q^]J 1 which can be shown to be equivalent to convolving the distribution 
h by the ^dual shuffie\ m*, defined as m*{a) = rn^^q^ While convolving 
by m* does not produce a distribution that factors independently, the Split 
algorithm from Huang et al. [2009] can still be shown to recover the Fourier 
transforms f^^^ and g^^^ of the maximum likelihood parameter estimates: 

Theorem 21. Given a set of rankings with empirical distribution h, the 
maximum likelihood estimates of the relative ranking distributions over item 
sets A and B are given by: 



(5.5) [/r"^,5r'^]oc Split 



* ^ 



where rrtp q zs the dual shuffle (of the uniform interleaving distribution). Fur- 
thermore, the Fourier transforms of the relative ranking distributions are: 



oc Split 



, for all frequency levels i. 



Proof. We will use TVp G Sp and TVq G Sq to denote relative rankings of A 
and B respectively. Let us consider estimating f^^^i^p)- If ^ ^he empirical 
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distribution of the training examples, then f^^^{7Tp) can be computed by 
summing over examples in which the relative ranking of A is consistent with 
TVp (Equation 5.2), or equivalently, by marginalizing h over the interleavings 
and the relative rankings of B. Thus, we have: 



where we have used Lemma 14 to decompose a ranking a into its component 
relative rankings and interleaving. 

The second step is to notice that the outer summation of Equation 5.6 is 
exactly the type of marginalization that can already be done in the Fourier 
domain via the Split algorithm of Huang et al [2009], and thus, f^^^ can be 
rewritten as f^^^ = Split(/i'), where the function /i' : ^ M is defined as 
reQA B H^^)' Hence, if we could compute the Fourier transform of 
the function /i', then we could apply the ordinary Split algorithm to recover 
the Fourier transform of f^^^ . 

In the third step, we observe that the function can be written as a 
convolution of the dual shuffle with /i, thus establishing the first part of the 
theorem: 



Next, we use a standard fact about Fourier transforms [Diaconis, 1988] — 
given a function m* : Sn ^ ^ defined as m*(a) = m(a~-^), the Fourier 
coefficient matrices of m* are related to those of m by the transpose. Hence, 
m*^ = m^, for every frequency level i. Applying the convolution theorem 
to the Fourier coefficients of the dual shuffle and the empirical distribution 
establishes the final part of the theorem. □ 

Notice that to compute the MLE relative ranking factors in the Fourier 
domain, it is not necessary to know the interleaving distribution. It is neces- 
sary, however, to compute the Fourier coefficients of the uniform interleaving 
distribution {rrip^q^)^ which we discuss in Appendix C. It is also necessary 
to normalize the output of Split to sum to one, but fortunately, normalizing 
a function h can be performed in the Fourier domain simply by dividing 
each Fourier coefficient matrix by (Proposition 20). See Algorithm 3 for 
pseudocode. 

5.4. Marginal preservation guarantees. Performing our Fourier domain 
algorithms with a complete set of Fourier coefficients is just as intractable as 
performing the computations naively. Typically, in the Fourier setting, one 



(5.6) 




28 



J. HUANG ET AL. 



hopes instead to work with a set of low-order terms. For example, in the 
case of RiffleJoin, we might only receive the second order marginals of the 
parameter distributions as input. A natural question to ask then, is what is 
the approximation quality of the output given a bandlimited input? We now 
state a result below, which shows how our algorithms perform when called 
with a truncated set of Fourier coefficients. 

Theorem 22. Given enough Fourier terms to reconstruct the k^^-order 
marginals of f and g, RiffleJoin returns enough Fourier terms to exactly re- 
construct the k^^ -order marginals ofh. Likewise, given enough Fourier terms 
to reconstruct the k^^ -order marginals ofh, RiffteSplit returns enough Fourier 
terms to exactly reconstruct the k^^ -order marginals of both f and g. 

Proof. This result is a simple consequence of the well-known convolution 
theorem (Proposition 20) and Theorems 9 and 12 from Huang et al. [2009]. 
Theorem 9 from Huang et al. [2009] states that, given 5^^-order marginals of 
factors / and ^, the Join algorithm can reconstruct the 5^^-order marginals 
of the joint distribution J-^, exactly. Since the riffie independent joint distri- 
bution is m ^ {f • g) and convolution operations are pointwise in the Fourier 
domain (Proposition 20), then given enough Fourier terms to reconstruct the 
5^^-order marginals of the function m^'^, we can also reconstruct the 5^^-order 
marginals of the riffie independent joint from the output of RiffieSplit. □ 

5.5. Running time. If the Fourier coefficient matrix for frequency level 
z of a joint distribution is d x d then the running time complexity of the 
Join/Split algorithms of Huang et al. [2009] are at worst, cubic in the dimen- 
sion, 0{d^). If the interleaving Fourier coefficients are precomputed ahead of 
time, then the complexity of Riffie Join/RiffieSplit is also 0{d^). 

If not, then we must Fourier transform the interleaving distribution. For 
RiffieJoin, we can Fourier transform the empirical distribution directly from 
the definition, or use the Algorithms presented in Appendix C in the case 
of biased riffie shuffies, which has 0{n?d^) running time in the worst case 
when p ^ 0{n). For RiffieSplit, one must compute the Fourier transform 
of the uniform interleaving distribution, which, as we have shown in Sec- 
tion 3.3, also takes the form of a biased riffie shuffie and therefore also can 
be computed in 0{v?d'^) time. In Section 11, we plot experimental running 
times. 

6. Hierarchical riffle independent decompositions. Thus far through- 
out the paper, we have focused exclusively on understanding riffied indepen- 
dent models with a single binary partitioning of the full item set. In this 
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section we explore a natural model simplification which comes from the sim- 
ple observation that, since the relative ranking distributions /a and qb are 
again distributions over rankings, the sets A and B can further be decom- 
posed into riffle independent subsets. We call such models hierarchical riffle 
independent decompositions. Continuing with our running example, one can 
imagine that the fruits are further partitioned into two sets, a set consist- 
ing of citrus fruits ((L) Lemons and (O) Oranges) and a set consisting of 
mediterranean fruits ((F) Figs and (G) Grapes). To generate a full ranking, 
one first draws rankings of the citrus and mediterranean fruits independently 
(|L, O] and |G,F], for example). Secondly, the two sets are interleaved to 
form a ranking of all fruits (|G,L,0,F]). Finally, a ranking of the veg- 
etables is drawn (|P, CJ) and interleaved with the fruit rankings to form a 
full joint ranking: [P, G, L, O, F, CJ. Notationally, we can express the hier- 
archical decomposition as {P, C} J-mi ({L, O} {F,G}). We can also 
visualize hierarchies using trees (see Figure 7(a) for our example). The sub- 
sets of items which appear as leaves in the tree will be referred to as leaf 
sets. 

A natural question to ask is: if we used a different hierarchy with the 
same leaf sets, would we capture the same distributions? For example, does 
a distribution which decomposes according to the tree in Figure 7(b) also 
decompose according to the tree in Figure 7(a)? The answer, in general, is 
no, due to the fact that distinct hierarchies impose different sets of indepen- 
dence assumptions, and as a result, different structures can be well or badly 
suited for modeling a given dataset. Consequently, it is important to use the 
"correct" structure if possible. 

6.1. Shared independence structure. It is interesting to note, however, 
that while the two structures in Figures 7(a) and 7(b) encode distinct families 
of distributions, it is possible to identify a set of independence assumptions 
common to both structures. In particular since both structures have the same 
leaf sets, any distributions consistent with either of the two hierarchies must 
also be consistent with what we call a 3-way decomposition. We define a d- 
way decomposition to be a distribution with a single level of hierarchy, but 
instead of partitioning the entire item set into just two subsets, one partitions 
into d subsets, then interleaves the relative rankings of each of the d subsets 
together to form a joint ranking of items. Any distribution consistent with 
either Figure 7(b) or 7(a) must consequently also be consistent with the 
structure of Figure 7(c). More generally, we have: 

Proposition 23. If h is a hierarchical riffle independent model with d 
leaf sets, then h can also be written as a d-way decomposition. 
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{C,P,L,0,F,G} 



{C,P} 

Vegetables 



{L,0,F,G} 

Fruits 



{L,0} {F,G} 
Citrus Medi- 
terranean 



(a) Example of hierarchical rif- 
fled independence structure on 

^6 

{C,P,L,0,F,G} 



{C,P} {L,0} {F,G} 



{C,P,L,0,F,G} 
{C,P,I^^O^^^F\G} 

{c,pOl;o} 



(b) Another example, not 
equivalent to (a) 




{A} {B} 



(c) 3-way decomposition for 
(generalizes the class of 
distributions parameterized 
by (a), (b) 



(d) Hierarchical decomposition 
into singleton subset, where 
each leaf set consists of a sin- 
gle item (we will also refer to 
this particular type of tree as a 
1-thin chain) 



Fig 7. Examples of distinct hierarchical riffle independent structures. 



Proof. We proceed by induction. Suppose the result holds for 5^/ for all 
< n. We want to establish that the result also holds for 5^. If /i factors 
according to a hierarchical riffle independent model, then it can be written 
as h — m • /a • QBi where m is the interleaving distribution, and gs 
themselves factor as hierarchical riffle independent distributions with, say, 
dl and d2 leaf sets, respectively (where di + d2 = d). By the hypothesis, since 
l^l, |S| < n, we can factor both /a and as di and (i2-way decompositions 
respectively. We can therefore write /a and qb as: 

dl d2 

fA{7TA) ^ mA{rAi,...,AdJ YlfA^ (0A,(7rA)) , ^s(7rs) = ms(Tsi,...,SdJ -11^^* (0s,(7rs)) . 

i=l i=l 

Substituting these decompositions into the factorization of the distribution 
/i, we have: 



h{a) = m(TA,s(cr))/A(0A(cr))^s(0s(cr)), 
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= (m(TA,s(cr))mA(TAi...,AaJms(rsi,...,Srfj) 

di d2 

UfA, {MMT)))YlgB, {<Pb,{<Pb{'t))) , 

i=l i=l 

di d2 

i=l i=l 

where the last hne fohows because any legitimate interleaving of the sets A 
and B is also a legitimate interleaving of the sets Ai, . . . , A^^, Si, ... , 
and since (/>a^(0a(^^)) = <i^Ai{o-). This shows that the distribution h factors 
as a (ii + (i2-way decomposition, and concludes the proof. □ 

In general, knowing the hierarchical decomposition of a model is more de- 
sirable than knowing its d-way decomposition which may require many more 
parameters , ), where i indexes over leaf sets^ For example, the n- 

way decomposition requires 0(n!) parameters and captures every distribu- 
tion over permutations. 

6.2. Thin chain models. There is a class of particularly simple hierarchi- 
cal models which we will refer to as fc-thin chain models. By a fc-thin chain 
model, we refer to a hierarchical structure in which the size of the smaller 
set at each split in the hierarchy is fixed to be a constant and can therefore 
be expressed as: 

{Ai ±rn {A2 ±m (^3 ±m •••))), \Ai\ = fc, for ah i. 

See Figure 7(d) for an example of 1-thin chain. We view thin chains as being 
somewhat analogous to thin junction tree models [Bach and Jordan, 2001], 
in which cliques are never allowed to have more than k variables. When 
k ^ 0(1), for example, the number of model parameters scales polynomially 
in n. To draw rankings from a thin chain model, one sequentially inserts 
items independently, one group of size at a time, into the full ranking. 

Theorem 24. The k^^ order marginals are sufficient statistics for a k- 
thin chain model. 

Proof. Corollary of Theorem 22 □ 

Example 25 (APA election data (continued)). The APA, as described 
by Diaconis [1989], is divided into ^^academicians and clinicians who are on 
uneasy terms^^. In 1980, candidates {1,3} (W. Bevan and C. Kiesler who 
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{WB, CK, MS, LW} {11} 

2 

Community 



{MS, LW} {MS, LW} 

1,3 4,5 

Research Clinical 



Fig 8. Hierarchical structure learned from APA data. 



were research psychologists) and {4,5} (M.Siegle and L. Wright, who were 
clinical psychologists) fell on opposite ends of this political spectrum with 
candidate 2 (I. Iscoe) being somewhat independent. Diaconis conjectured that 
voters choose one group over the other, and then choose within. We are now 
able to verify Diaconis ^ conjecture using our riffled independence framework. 
After removing candidate 2 from the distribution, we perform a search within 
candidates {1, 3, 4, 5} to again find nearly riffle independent subsets. We find 
that A — {1,3} and B — {4,5} are very nearly riffle independent (with 
respect to KL divergence) and thus are able to verify that candidate sets {2}, 
{1,3}; {4,5} are indeed grouped in a riffle independent sense in the APA 
data. We remark that in a later work. Harden [1995] identified candidate 2 
(I. Iscoe) as belonging to yet a third group of psychologists called community 
psychologists. The hierarchical structure that best describes the APA data is 
shown in Figure 8 and the KL-divergence from the true distribution to the 
hierarchical model is dxL — -0676. 

Finally for the two main opposing groups within the APA, the riffle shuf- 
fling distribution for sets {1,3} and {4,5} is not well approximated by a 
biased riffle shuffle. Instead, since there are two coalitions, we fit a mixture 
of two biased riffle shuffles to the data and found the bias parameters of the 
mixture components to be .67 and .17; indicating that the two 

components oppose each other (since ai and a2 lie on either side of .5). 

7. Structure discovery I: objective functions. Since different hi- 
erarchies impose different independence assumptions, we would like to find 
the structure that is best suited for modeling a given ranking dataset. On 
some datasets, a natural hierarchy might be available — for example, if one 
were familiar with the typical politics of APA elections, then it may have 
been possible to "guess" the optimal hierarchy. However, for general ranked 
data, it is not always obvious what kind of groupings riffled independence 
will lead to, particularly for large n. Should fruits really be riffle independent 
of vegetables? Or are green foods riffle independent of red foods? 

Over the next three sections, we address the problem of automatically dis- 
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covering hierarchical riffle independent structures from training data. Key 
among our observations is the fact that while item ranks cannot be inde- 
pendent due to mutual exclusivity, relative ranks between sets of items are 
not subject to the same constraints. More than simply being a 'clustering' 
algorithm, however, our procedure can be thought of as a structure learning 
algorithm, like those from the graphical models literature Koller and Fried- 
man [2009], which find the optimal (riffled) independence decomposition of 
a distribution. 

The base problem that we address in this current section is how to find 
the best structure if there is only one level of partitioning and two leaf sets, 
B. Alternatively, we want to find the topmost partitioning of the tree. In 
Section 8, we use this base case as part of a top-down approach for learning 
a full hierarchy. 

7.1. Problem statement. Given then, a training set of rankings, 

/i, drawn i.i.d. from a distribution in which a subset of items, 
A C {1, . . . , n}, is riffle independent of its complement, the problem which 
we address in this section is that of automatically determining the sets A 
and B. U h does not exactly factor riffle independently, then we would like 
to find the riffle independent approximation which is closest to h in some 
sense. Formally, we would like to solve the problem: 

(7.1) argminmin Dkl^K^) \\m(TA,B(cr))f((l>A(cr))9((l)B(cr))), 

A mj,g 

where h is the empirical distribution of training examples and Dkl is the 
Kullback-Leibler divergence measure. Equation 7.1 is a seemingly reasonable 
objective since it can also be interpreted as maximizing the likelihood of the 
training data. In the limit of infinite data. Equation 7.1 can be shown via 
the Gibbs inequality to attain its minimum, zero, at the subsets A and if 
and only if the sets A and B are truly riffle independent of each other. 

For small problems, one can actually solve Problem 7.1 using a single 
computer by evaluating the approximation quality of each subset A and 
taking the minimum, which was the approach taken in Example 25. However, 
for larger problems, one runs into time and sample complexity problems since 
optimizing the globally defined objective function (Equation 7.1) requires 
relearning all model parameters (m, /a, and qb) for each of the exponentially 
many subsets of {1, . . . , n}. In fact, for large sets A and it is rare that one 
would have enough samples to estimate the relative ranking parameters /a 
and gB without already having discovered the hierarchical riffle independent 
decompositions of A and B. We next propose a more locally defined objective 
function, reminiscent of clustering, which we will use instead of Equation 7.1. 
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As we show, our new objective will be more tractable to compute and have 
lower sample complexity for estimation. 

7.2. Proposed objective function. The approach we take is to minimize a 
different measure that exploits the observation that absolute ranks of items in 
A are fully independent of relative ranks of items in B, and vice versa (which 
we prove in Proposition 26). With our vegetables and fruits, for example, 
knowing that Figs is ranked first among all six items (the absolute rank 
of a fruit) should give no information about whether Corn is preferred to 
Peas (the relative rank of vegetables). More formally, given a subset A = 
{ai, . . . , a^}, recall that cr(A) denotes the vector of (absolute) ranks assigned 
to items in A by a (thus, cr(A) = (cr(ai), cr(a2), . . . ,a(a^))). We propose to 
minimize an alternative objective function: 

(7.2) T{A) = I{aiA) ; ^^(a)) + I{a{B) ; ^^(cj)), 

where / denotes the mutual information (defined between two variables Xi 
and X2 by I{Xr,X2) = Dkl{P{Xi, X2)\\P{Xi)P{X2)). 

The function does not have the same likelihood interpretation as the 
objective function of Equation 7.1. However, it can be thought of as a com- 
posite likelihood of two models, one in which the relative rankings of A are 
independent of absolute rankings of and one in which the relative rank- 
ings of B are independent of absolute rankings of A (see Appendix B.2). 
With respect to distributions which satisfy (or approximately satisfy) both 
models (i.e., the riffle independent distributions), minimizing T is equiv- 
alent to (or approximately equivalent to) maximizing the log likelihood of 
the data. Furthermore, we can show that T is guaranteed to detect riffled 
independence: 

Proposition 26. J^{A) = is a necessary and sufficient criterion for a 
subset A C {1^ ... ^n} to be riffle independent of its complement, B. 

Proof. Suppose A and B are riffle independent. We first claim that cr(A) 
and (j)B{(^) are independent. To see this, observe that the absolute ranks of 
A, cr(74), are determined by the relative rankings of A, 0a(c'") and the in- 
terleaving ta^b{(^)- By the assumption that A and B are riffle independent, 
we know that the relative rankings of A and B [(t)A{o-) and 0b (c^")), and 
the interleaving ta^b{(^) are independent, establishing the claim. The argu- 
ment that cr[B) and (j)A{o-) are independent is similar, thus establishing one 
direction of the proposition. 

To establish the reverse direction, assume that Equation 7.2 evaluates to 
zero on sets A and B. It follows that cr{A) ± (Pb{o-) and 0a(<^) ^{B). 
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Now, as a converse to the observation from above, note that the absolute 
ranks of A determine the relative ranks of A, (/)^(cr), as well as the in- 
terleaving TA^Bio-)' Similarly, (j{B) determines (j)B{(^) and ta^b{(^)- Thus, 
(0A(cr),TA,B(cr)) ^ and ^^(cr) ^ {ta^b{(^)Ab{(^))' It then follows 

that ^ ^a,b{(^) ^ □ 

As with Equation 7.1, optimizing is still intractable for large n. However, 
motivates a natural proxy, in which we replace the mutual informations 

defined over all n variables by a sum of mutual informations defined over 

just three variables at a time. 

Definition 27 (Tripletwise mutual informations). Given any triplet of 
distinct items, fc), we define the tripletwise mutual information term, 
Ii',j,k = ; cr(j) < (j{k)). 

The tripletwise mutual information h-j^k can be computed as follows: 

where the inside summation runs over two values, true/false, for the binary 
variable a{j) < a{k). To evaluate how riffle independent two subsets A and 
B are, we want to examine the triplets that straddle the two sets. 

Definition 28 (Internal and Cross triplets). We define I^^Tb^ 
set of triplets which "cross" from set A to set B\ fi^^^^ = fc) : i G 

A, j, k G B}. f^g""^^ is similarly defined. We also define 1^^^ to be the set of 
triplets that are internal to A\ fi^^ = {(i; j, k) : i, j, k G A}, and again, fi^* 
is similarly defined. 

Our proxy objective function can be written as the sum of the mutual 
information evaluated over all of the crossing triplets: 

T can be viewed as a low order version of J- ^ involving mutual informa- 
tion computations over triplets of variables at a time instead of n-tuples. 
The mutual information li-^j^h-, foi* example, reflects how much the rank of a 
vegetable [%) tells us about how two fruits (j. A:) compare. If A and B are 
riffle independent, then we know that l^jM ~ ^ (hj^k) such that 

i ^ A^ j^k ^ B (and similarly for any (i^j^k)) such that i ^ B^ j,k ^ A. 
Given that fruits and vegetables are riffle independent sets, knowing that 
Grapes is preferred to Figs should give no information about the absolute 
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Fig 9. Examples: (a) shows a graphical depiction of the problem of finding riffle inde- 
pendent subsets. A triangle with vertices {i,j,k) represents the term Ii-j,k Since the Ii-j,k 
are not invariant with respect to a permutation of the indices i, j, and k, the triangles 
are directed, and we therefore use double bars represent the nodes j,k for the term Ii-j,k' 
Note that if the tripletwise terms were instead replaced by edgewise terms, the problem 
would simply be a standard clustering problem; (b) shows the matrix of tripletwise mutual 
informations computed from the APA dataset (see Example 29). 



rank of Corn, and therefore Icom-^Grapes.Figs should be zero. Note that such 
tripletwise independence assertions bear resemblance to assumptions some- 
times made in social choice theory, commonly referred to as Independence of 
Irrelevant Alternatives [Arrow, 1963], where the addition of a third element 
z, is assumed to not affect whether one prefers an element j over k. 

The objective is somewhat reminiscent of typical graphcut and clus- 
tering objectives. Instead of partitioning a set of nodes based on sums of 
pairwise similarities, we partition based on sums of tripletwise affinities. We 
show a graphical depiction of the problem in Figure 9(a), where cross triplets 
(in ri^iB^' ri^^^) have low weight and internal triplets (in fi^^) have 
high weight. The objective is to find a partition such that the sum over cross 
triplets is low. In fact, the problem of optimizing can be seen as an in- 
stance of the weighted, directed hypergraph cut problem [Gallo et al.^ 1993]. 
Note that the word directed is significant for us, because, unlike typical clus- 
tering problems, our triplets are not symmetric (for example, li-jk 7^ Ij;ik)i 
resulting in a nonstandard and poorly understood optimization problem. 

Example 29 (APA election data (continued)). Figure 9(a) visualizes the 
tripletwise mutual informations computed from the APA dataset. Since there 
are five candidates, there are (2) = 10 pairs of candidates. The (i, (j, k)) entry 
in the matrix corresponds to I{a{i); a{j) < cr{k)). For easier visualization, 
we have set entries of the form (z, (z, k)) and (z, (j, z)) to he zero since they 
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are not counted in the objective function. 

The highlighted row corresponds to candidate 2, in which all of the mutual 
information terms are close to zero. We see that the tripletwise mutual in- 
formation terms tell a story consistent with the conclusion of Example 17, 
in which we showed that candidate 2 was approximately riffle independent of 
the remaining candidates. 

Finally, it is also interesting to examine the (3,(1,4)) entry. It is the 
largest mutual information in the matrix, a fact which should not he surpris- 
ing since candidates 1 and 3 are politically aligned (both research psychol- 
ogists). Thus, knowing, for example, that candidate 3 was ranked first is a 
strong indication that candidate 1 was preferred over candidate 4- 

7.3. Encouraging balanced partitions. In practice, like the minimum cut 
objective for graphs, the tripletwise objective of Equation 7.3 has a tendency 
to "prefer" small partitions (either l^l or \B\ very small) to more balanced 
partitions {\A\^\B\ ^ n/2) due to the fact that unbalanced partitions have 
fewer triplets that cross between A and B. The simplest way to avoid this 
bias is to optimize the objective function over subsets of a fixed size k. As 
we discuss in the next section, optimizing with a fixed k can be useful for 
building thin hierarchical riffle independent models. Alternatively, one can 
use a modified objective function that encourages more balanced partitions. 
For example, we have found the following normalized cut [Shi and Malik, 
2000] inspired variation of our objective to be useful for detecting riffled 
independence when the size k is unknown: 



^ ^balanced 



Intuitively, the denominator in Equation 7.4 penalizes subsets whose interiors 
have small weight. Note that there exist many variations on the objective 
function that encourage balance, but J^^^^^^^ed jg ^^^^ have used 

in our experiments. 

7.4. Low-order detectability assumptions.. When does T detect riffled in- 
dependence? It is not difficult to see, for example, that ^ = is a necessary 
condition for riffled independence, since A ±^ B implies Ia;b,b' — 0- We have: 

Proposition 30. If A and B are riffle independent sets, then ^(A) = 0. 



However, the converse of Proposition 30 is not true in full generality with- 
out accounting for dependencies that involve larger subsets of variables. Just 
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as the pairwise independence assumptions that are commonly used for ran- 
domized algorithms [Motwani and Raghavan, 1996]^ do not imply full in- 
dependence between two sets of variables, there exist distributions which 
"look" riffle independent from tripletwise marginals but do not factor upon 
examining higher-order terms. Nonetheless, in most practical scenarios, we 
expect = to imply riffled independence. 

7.5. Quadrupletwise objective functions for riffled independence. A natu- 
ral variation of our method is to base the objective function on the following 
quantities, defined over quadruplets of items instead of triplets: 

(7.5) lij-ki = I{<t{i) < a{j) ; a{k) < a{e)). 

Intuitively, lij-kl measures how much knowing that, say. Peas is preferred 
to Corn, tells us about whether Grapes are preferred to Oranges. Again, if 
the fruits and vegetables are riffle independent, then the mutual information 
should be zero. Summing over terms which cross between the cut, we obtain a 
quadrupletwise objective function defined as: J^'^i'^^^ (^A) = ^^^^ ^-^^^^ ^^^^^ lij^kl 
If A and B are riffle independent with i^j^A and k^i ^ then the mutual 
information lij-kl is zero. Unlike their tripletwise counterparts, however, the 
lij^kl do not arise from a global measure that is both necessary and sufficient 
for detecting riffled independence. In particular, /((/)^(cr); = is in- 

sufficient to guarantee riffled independence. For example, if the interleaving 
depends on the relative rankings of A and B, then riffled independence is not 
satisfied, yet — Q. Moreover, it is not clear how one would detect 

riffle independent subsets consisting of a single element using a quadruplet- 
wise measure. As such, we have focused on tripletwise measures in our ex- 
periments. Nonetheless, quadrupletwise measures may potentially be useful 
in practice (for detecting larger subsets) and have the significant advantage 
that the lij-kl can be estimated with fewer samples and using almost any 
imaginable form of partially ranked data. 

7.6. Estimating the objective from samples. We have so far argued that 
J-* is a reasonable function for finding riffle independent subsets. However, 
since we only have access to samples rather than the true distribution h 
itself, it will only be possible to compute an approximation to the objective 

In particular, for every triplet of items, {i,j,k)^ we must compute an 
estimate of the mutual information /^.j from i.i.d. samples drawn from /i, 

^ A pairwise independent family of random variables is one in which any two members 
are marginally independent. Subsets with larger than two members may not necessarily 
factor independently, however. 
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and the main question is: how many samples wih we need in order for the 
approximate version of J- to remain a reasonable objective function? 

In the following, we denote the estimated value of li-^j^h by h-^j^k- Foi* each 
triplet, we use a regularized procedure due to Hoffgen [1993] to estimate 
mutual information. We adapt his sample complexity bound to our problem 
below. 

Lemma 31. For any fixed triplet {i,j,k), the mutual information h-j^k 
can he estimated to within an accuracy of A with probability at least 1 — 7 
using S{A,j) = O log^ ^ log i.i.d. samples and the same amount of 
time. 

The approximate objective function is therefore: 

:f{a)= Y1 

What we want to now show is that, if there exists a unique way to partition 
{1, . . . , n} into riffle independent sets, then given enough training examples, 
our approximation uniquely singles out the correct partition as its min- 
imum with high probability. A class of riffle independent distributions for 
which the uniqueness requirement is satisfied consists of the distributions for 
which A and B are strongly connected according to the following definition. 

Definition 32. A subset A c {1, . . . , n} is called e-third-order strongly 
connected if, for every triplet z, j^k ^ A with z, j, k distinct, we have li.j^k > ^• 

If a set A is riffle independent of B and both sets are third order strongly 
connected, then we can ensure that riffled independence is detectable from 
third-order terms and that the partition is unique. We have the following 
probabilistic guarantee. 

Theorem 33. Let A and B be e-third order strongly connected riffle in- 
dependent sets, and suppose \A\ — k. Given S{A^e) = O ^^log^Mog^^ 

i.i.d. samples, the minimum of JF is achieved at exactly the subsets A and B 
with probability at least 1 — 7. 

See the Appendix for details. Finally, we remark that the strong connec- 
tivity assumptions used in Theorem 33 are stronger than necessary — and 
with respect to certain interleaving distributions, it can even be the case 
that the estimated objective function singles out the correct partition when 
all of internal triplets belonging to A and B have zero mutual information. 
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Moreover, in some cases, there are multiple valid partitionings of the item 
set. For example the uniform distribution is a distribution in which every 
subset A C {1, . . . , n} is riffle independent of its complement. In such cases, 
multiple solutions are equally good when evaluated under T ^ but not its 
sample approximation, T . 

8. Structure discovery II: algorithms. Having now designed a func- 
tion that is tractable to estimate from both perspectives of computational 
and sample complexity, we turn to the problem of learning the hierarchical 
riffle independence structure of a distribution from training examples. In- 
stead of directly optimizing an objective in the space of possible hierarchies, 
we take a simple top-down approach in which the item sets are recursively 
partitioned by optimizing T until some stopping criterion is met (for exam- 
ple, when the leaf sets are smaller than some A;, or simply stopping after a 
fixed number of splits). 

8.1. Exhaustive optimization. Optimizing the function requires search- 
ing through the collection of subsets of size 1^41 = A:, which, when performed 
exhaustively, requires O ((^)) time. An exhaustive approach thus runs in 
exponential time, for example, when k ^ 0(n). 

However, when the size of k is known and small (k ^ the optimal 

partitioning of an item set can be found in polynomial time by exhaustively 
evaluating T over all fc-subsets. 

Corollary 34. Under the conditions of Theorem 33, one needs at most 
S{A, e) = O log^ ^ log samples to recover the exact riffle independent 
partitioning with probability 1 — 7. 

When k is small, we can therefore use exhaustive optimization to learn 
the structure of fc-thin chain models (Section 6.2) in polynomial time. The 
structure learning problem for thin chains is to discover how the items are 
partitioned into groups, which group is inserted first, which group is inserted 
second, and so on. To learn the structure of a thin chain, we can use ex- 
haustive optimization to learn the topmost partitioning of the item set, then 
recursively learn a thin chain model for the items in the larger subset. 

8.2. Handling arbitrary partitions using anchors. When k is large, or 
even unknown, cannot be optimized using exhaustive methods. Instead, 
we propose a simple algorithm for finding A and B based on the follow- 
ing observation. If an oracle could identify any two elements of the set A, 
say, ai,a2, in advance, then the quantity Ix]ai,a2 — I{x]ai < indicates 
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Anchors Partition 

input : training set {cr^-^\ . . . , cr^^^}, k = \A\ 

output: Riffle independent partitioning of item set, (A^est , ^best) 

Fix 0,1 to be any item ; 

forall a2 G {1, . . . , n}, ai / 02 do 

Estimate Ix;ai,a2 for cill x / ai, 02; 

jk ^ j^th gmallest item in {Ix;ai,a2J ^ / 01,02} ; 

^ai,a2 {"^ • Ix;ai,a2 — -^^} ] 

end 

Afoest ^ argminai,a2 ^(^ai,a2); 
^best ^ {1, • • • ,^}\^best ; 

return [A^est, ^6est]; 
Algorithm 4: Pseudocode for partitioning using the Anchors method 



whether the item x belongs to A or S since /x;oi,02 is nonzero in the first 
case, and zero in the second case. 

For finite training sets, when / is only known approximately, one can sort 
the set {/a:;oi,02 X ai^a2} and if k is known, take the k items closest to 
zero to be the set B (when k is unknown, one can use a threshold to infer A:). 
Since we compare all items against ai, a2, we refer to these two fixed items 
as "anchors". 

Of course ai,a2 are not known in advance, but by fixing ai to be an 
arbitrary item, one can repeat the above method for all n — 1 settings of a2 
to produce a collection of 0{n?) candidate partitions. Each partition can then 
be scored using the approximate objective J^, and a final optimal partition 
can be selected as the minimum over the candidates. See Algorithm 4. In 
cases when k is not known a priori, we evaluate partitions for all possible 
settings of k using 

Since the Anchors method does not require searching over subsets, it can 
be significantly faster than an exhaustive optimization of Moreover, by 
assuming e-third order strong connectivity as in the previous section, one 
can use similar arguments to derive sample complexity bounds. 

Corollary 35 (of Theorem 33). Let A and B he e-third order strongly 
connected riffle independent sets, and suppose \A\ — k. Given S{A^e) i.i.d. 
samples, the output of the Anchors algorithm is exactly [A, B] with probability 
1 — 7. In particular, the Anchors estimator is consistent. 

We remark, however, that there are practical differences that can at times 
make the Anchors method somewhat less robust than an exhaustive search. 
Conceptually, anchoring works well when there exists two elements that are 
strongly connected with all of the other elements in its set, which can then 
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be used as the anchor elements ai, a2. An exhaustive search can work weh in 
weaker conditions such as when items are strongly connected through longer 
paths. We show in our experiments that the Anchors method can nonetheless 
be quite effective for learning hierarchies. 

8.3. Running time. We now consider the running time of our structure 
learning procedures. In both cases, it is necessary to precompute the mutual 
information quantities li-j^k foi* cill triplets from m samples. For each 

triplet, we can compute li-j^k in linear time with respect to the sample size. 
The set of all triplets can therefore be computed in 0{min?) time. 

The exhaustive method for finding the fc-subset which minimizes re- 
quires evaluating the objective function at (^) = 0{n^) subsets. What is 
the complexity of evaluating at a particular partition A, Bl We need to 
sum the precomputed mutual informations over the number of triangles that 
cross between A and B.li\A\— k and |S| = n — A:, then we can bound the 
number of such triangles by k{n — k)'^^k'^{n — k) = 0{kin?). Thus, we require 
0{n^ + km?) optimization time, leading to a bound of 0{kn^^^ + mn^) total 
time. 

The Anchors method requires us to (again) precompute mutual informa- 
tions. The other seeming bottleneck is the last step, in which we must evalu- 
ate the objective function JF at 0{in?) partitions. In reality, if |A| and \B\ are 
both larger than 1, then ai can be held fixed at any arbitrary element, and we 
must only optimize over 0{n) partitions. When 1^41 = \B\ — 1, then n = 2, 
in which case the two sets are trivially riffle independent (independent of the 
actual distribution). As we showed in the previous paragraph, evaluating 
requires 0{kin?) time, and thus optimization using the Anchors method = 
0(n^(A: + m)) total time. Since k is much smaller than m (in any meaningful 
training set), we can drop it from the big-0 notation to get 0{mn^) time 
complexity, showing that the Anchors method is dominated by the time that 
is required to precompute and cache mutual informations. 

9. Structure discovery III: quantifying stability. Given a hierar- 
chy estimated from data, we now discuss how one might practically quantify 
how confident we should be about the hypothesized structure. We might like 
to know if the amount of data that was used for estimating the structure 
was adequate the support the learned structure, and, if the the data looked 
slightly different, would the hypothesis change? 

Bootstrapping [Efron and Tibshirani, 1993] offers a simple approach — 
repeatedly resample the data with replacement, and estimate a hierarchi- 
cal structure for each resampling. The difference between our setting and 
typical bootstrapping settings, however, is that our structures lie in a large 
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Fig 10. We show the distribution of structures estimated from bootstrapped samples of the 
APA data (with varying sample sizes): (a) plots (in solid red) the fraction of bootstrapped 
trees for each sample size which agree exactly with the hierarchy given in Figure 8; In (b ), 
we summarize the boostrap distribution for the largest sample sizes. 



discrete set. Thus, unlike continuous parameters, whose confidence we can 
often summarize with intervals or ellipses, it is not clear how one might 
compactly summarize a collection of many hierarchical clusterings of items. 

The simplest way to summarize the collection of hierarchies obtained via 
the bootstrap is to measure the fraction of the estimated structures which are 
identical to the structure estimated from the original unperturbed dataset. 
If, for small sets of resampled data, the estimated hierarchy is consistently 
identical to that obtained from the original data, then we can be confident 
that the data supports the hypothesis. We show in the following example 
that, for the structure which was learned from the APA dataset, a far smaller 
dataset would have sufficed. 

Example 36 (APA Election data (continued)). As our final APA related 
example, we show the results of bootstrap resampling in Figure 10. To gener- 
ate the plots, we resampled the APA dataset with replacement 200 times each 
for varying sample sizes, and ran our Anchors algorithm on each resulting 
sample. Figure 10(a) plots (in solid red) the fraction of bootstrapped trees for 
each sample size which agree exactly with the hierarchy given in Figure 8. 
Given that we forced sets to be partitioned until they had at most 2 items, 
there are 120 possible hierarchical structures for the APA dataset. 

It is interesting to see that the hierarchies returned by the algorithm are 
surprisingly stable even given fewer than 100 samples, with about 25% of 
bootstrapped trees agreeing with the optimal hierarchy. At 1000 samples, al- 
most all trees agree with the optimal hierarchy. In Figure 10(b), we show a 
table of the bootstrap distribution for the largest sample sizes (which were 
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concentrated at only a handful of trees). 

For larger item sets n, however, it is rarely the case that there is enough 
data to strongly support the hierarchy in terms of the above measure. In 
these cases, instead of asking whether entire structures agree with each 
other exactly^ it makes sense to ask whether estimated substructures agree. 
For example, a simple measure might amount to computing the fraction of 
structures estimated from resampled datasets which agreed with the original 
structure at the topmost partition. Another natural measure is to count the 
fraction of structures which correctly recovered all (or a subset of) leaf sets 
for the original dataset, but not necessarily the correct hierarchy. By Propo- 
sition 23, correctly discovering the leaf set partitioning is probabilistically 
meaningful, and corresponds to correctly identifying the d-way decomposi- 
tion corresponding to a distribution, but failing to identifying the specific 
hierarchy. 

We remark that sometimes, there is no one unique structure correspond- 
ing to a distribution. The uniform distribution, for example, is consistent 
with any hierarchical riffle independent structure, and so bootstrapped hier- 
archies will not concentrate on any particular structure or even substructure. 
Moreover, even when there is true unique structure corresponding to the gen- 
erating distribution, it may be the case that other simpler structures perform 
better when there is not much available training data. 

10. Related work. Our work draws from several literatures: card shuf- 
fling research due primarily to Persi Diaconis and collaborators [Bayer and 
Diaconis, 1992; Fulman, 1998], papers about Fourier theoretic probabilistic 
inference over permutations from the machine learning community[Huang, 
Guestrin and Guibas, 2009b; Huang et a/., 2009; Kondor, 2008], as well as 
graphical model structure learning research. 

10.1. Card shuffling theory. Bayer and Diaconis [1992] provided a a con- 
vergence analysis of repeated riffle shuffles. Our novelty lies in the combi- 
nation of shuffling theory with independence, which was first exploited in 
Huang et al. [2009], for scaling inference operations to large problems. Fi- 
nally, we remark that Fulman [1998] introduced a class of shuffles known 
as biased riffle shuffles which are not the same as the biased riffle shuffles 
discussed in our paper. The fact that the uniform riffle shuffling can be real- 
ized by dropping card with probability proportional to the number of cards 
remaining in each hand has been observed in a number of papers [Bayer and 
Diaconis, 1992], but we are the first to (1) formalize this in the form of the 
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recurrence given in Equation D.l, and (2) to compute the Fourier transform 
of the uniform and biased riffle shuffling distributions. 

10.2. Fourier analysis on permutations. Our dynamic programming ap- 
proach bears some similarities to the FFT (Fast Fourier Transform) algo- 
rithm proposed by Clausen and Baum [1993], and in particular, relies on 
the same branching rule recursions [Sagan, 2001]. While the Clausen FFT 
requires 0(n! log(n!)) time, since our biased riffle shuffles are parameterized 
by a single a, we can use the recurrence to compute low-frequency Fourier 
terms in polynomial time. 

10.3. Learning structured representations. Our insights for the structure 
learning problems are inspired by some of the recent approaches in the ma- 
chine learning literature for learning the structure of thin junction trees [Bach 
and Jordan, 2001]. In particular, the idea of using a low order proxy objec- 
tive with a graph-cut like optimization algorithm is similar to an idea which 
was recently introduced in Shahaf, Chechetka and Guestrin [2009], which 
determines optimally thin separators with respect to the Bethe free energy 
approximation (of the entropy) rather than a typical log-likelihood objective. 
Our sample analysis is based on the mutual information sample complexity 
bounds derived in Hoffgen [1993], which was also used in Chechetka and 
Guestrin [2007] for developing a structure learning algorithm for thin junc- 
tion trees with provably polynomial sample complexity. Finally, the boot- 
strap methods which we have employed in our experiments for verifying 
robustness bear much resemblance to some of the common bootstrapping 
methods which have been used in bioinformatics for analyzing phylogenetic 
trees [Holmes, 1999, 2003]. 

11. Experiments. In this section, we present a series of experiments to 
validate our models and methods. All experiments were implemented in Mat- 
lab, except for the Fourier theoretic routines, which were written in C++. We 
tested on lab machines with two AMD quadcore Opteron 2.7GHz processors 
with 32 Gb memory. We have already analyzed the APA data extensively 
throughout the paper. Here, we demonstrate our algorithms on simulated 
data as well as other real datasets, namely, sushi preference data, and Irish 
election data. 

11.1. Simulated data. We begin with a discussion of our simulated data 
experiments. We first consider approximation quality and timing issues for 
a single binary partition of the item set. 
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Fig 1 1 . Synthetic data experiments for a single partitioning of the item set 



Binary partitioning of the item set. To understand the behavior of Riffle- 
Spht in approximately riffle independent situations, we drew sample sets of 
varying sizes from a riffle independent distribution on (with bias param- 
eter a — .25) and use RiffleSplit to estimate the relative ranking factors and 
interleaving distribution from the empirical distribution. In Figure 11(a), we 
plot the KL-divergence between the true distribution and that obtained by 
applying RiffleJoin to the estimated riffle factors. With small sample sizes 
(far less than 8! = 40320), we are able to recover accurate approximations 
despite the fact that the empirical distributions are not exactly riffle indepen- 
dent. For comparison, we ran the experiment using the Split algorithm Huang 
et al. [2009] to recover the parameters. Perhaps surprisingly, one can show 
that the Split algorithm from Huang et al. [2009] is also an unbiased, con- 
sistent estimator of the riffle factors, but it does not return the maximum 
likelihood parameter estimates because it effectively ignores rankings which 
are not contained in the subgroup Sp x Sq. Consequently, our RiffleSplit 
algorithm converges to the correct parameters with far fewer samples. 

Next, we show that our Fourier domain algorithms are capable of han- 
dling sizeable item sets (with size n) when working with low-order terms. 
In Figure 11(b) we ran our Fourier domain RiffleJoin algorithm on various 
simulated distributions. We plot running times of RiffleJoin (without precom- 
puting the interleaving distributions) as a function of n (setting p = 
which is the worst case) scaling up to n = 40. 

Learning a hierarchy of items. We next applied our methods to synthetic 
data to show that, given enough samples, our algorithms do effectively re- 
cover the optimal hierarchical structures which generated the original datasets 
For various settings of n, we simulated data drawn jointly from a fc-thin chain 



UNCOVERING THE RIFFLED INDEPENDENCE STRUCTURE OF RANKINGS!? 




(a) Success rate for structure re- (b) Number of samples required 

covery vs. sample size (n = 16, /c = for structure recovery vs. number 

4) of items n 




3 4 
Iog10(# samples) 



(c) Test set log-likelihood compar- 




2 3 4 

Iog10(# of samples) 



(d) Anchors algorithm success rate 
(n = 16, unknown k) 



Fig 12. Structure discovery experiments on synthetic data 



model (for A: = 4) with a random parameter setting for each structure and 
apphed our exact method for learning thin chains to each sampled dataset. 
First, we investigated the effect of varying sample size on the proportion of 
trials (out of fifty) for which our algorithms were able to (a) recover the full 
underlying tree structure exactly^ (b) recover the topmost partition correctly, 
or (c) recover all leaf sets correctly (but possibly out of order). Figure 12(a) 
shows the result for an itemset of size n = 16. Figure 12(b), shows, as a 
function of n, the number of samples that were required in the same experi- 
ments to (a) exactly recover the full underlying structure or (b) recover the 
correct leaf sets, for at least 90% of the trials. What we can observe from 
the plots is that, given enough samples, reliable structure recovery is indeed 
possible. It is also interesting to note that recovery of the correct leaf sets 
can be done with much fewer samples than are required for recovering the 
full hierarchical structure of the model. 

After learning a structure for each dataset, we learned model parameters 
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1. ebi (shrimp) 2. anago (sea eel) 3. maguro (tuna) 

4. ika (squid) 5. uni (sea urchin) 6. sake (salmon roe) 

7. tamago (egg) 8. toro (fatty tuna) 9. tekka-maki (tuna roll) 

10. kappa-maki (cucumber roll) 



Fig 13. List of sushi types in the Kamishima [2003] dataset 



and evaluated the log-likelihood of each model on 200 test examples drawn 
from the true distributions. In Figure 12(c), we compare log-likelihood per- 
formance when (a) the true structure is given (but not parameters), (b) a 
fc-thin chain is learned with known A:, and (c) when we use a random gener- 
ated 1-chain structure. As expected, knowing the true structure results in the 
best performance, and the 1-chain is overconst rained. However, our structure 
learning algorithm is eventually able to catch up to the performance of the 
true structure given enough samples. It is also interesting to note that the 
jump in performance at the halfway point in the plot coincides with the jump 
in the success rate of discovering all leaf sets correctly — we conjecture that 
performance is sometimes less sensitive to the actual hierarchy used, as long 
as the leaf sets have been correctly discovered. 

To test the Anchors algorithm, we ran the same simulation using Algo- 
rithm 4 on data drawn from hierarchical models with no fixed k. We gen- 
erated roughly balanced structures, meaning that item sets were recursively 
partitioned into (almost) equally sized subsets at each level of the hierarchy. 
From Figure 12(d), we see that the Anchors algorithm can also discover the 
true structure given enough samples. Interestingly, the difference in sample 
complexity for discovering leaf sets versus discovering the full tree is not 
nearly as pronounced as in Figure 12(a). We believe that this is due to the 
fact that the balanced trees have less depth than the thin chains, leading to 
fewer opportunities for our greedy top-down approach to commit errors. 

11.2. Data analysis: sushi preference data. We now turn to analyzing 
real datasets. For our first analysis, we examine a sushi preference ranking 
dataset [Kamishima, 2003] consisting of 5000 full rankings of ten types of 
sushi. The items are enumerated in Figure 13. Note that, compared to the 
APA election data, the sushi dataset has twice as many items, but fewer 
examples. 

We begin by studying our methods in the case of a single binary parti- 
tioning of the item set. Unlike the APA dataset, there is no obvious way 
to naturally partition the types of sushi into two sets — in our first set of 
experiments, we have arbitrarily divided the item set into A = {1, . . . , 5} 
and S = {6, ...,10}. 
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Fig 14. Sushi preference ranking experiments 



We divided the data into training and test sets (with 500 examples) and 
estimated the true distribution in three ways: (1) directly from samples (with 
regularization) , (2) using a riffle independent distribution (split evenly into 
two groups of five and mentioned above) with the optimal shuffling distri- 
bution m, and (3) with a biased riffle shuffle (and optimized bias a). Fig- 
ure 14(a) plots testset log-likelihood as a function of training set size — we 
see that riffle independence assumptions can help significantly to lower the 
sample complexity of learning. Biased riffle shuffles, as can also be seen, are 
a useful learning bias with very small samples. 

As an illustration of the behavior of biased riffle shuffles, see Figure 14(b) 
which shows the approximate first-order marginals of Uni (Sea Urchin) rank- 
ings, and the biased riffle approximation. The Uni marginals are interesting, 
because while many people like Uni, thus providing high rankings, many peo- 
ple also hate it, providing low rankings. The first-order marginal estimates 
have significant variance at low sample sizes, but with the biased riffle ap- 
proximation, one can achieve a reasonable approximation to the distribution 
even with few samples at the cost of being somewhat oversmoothed. 

Structure learning on the sushi dataset. Figure 16(b) shows the hierarchical 
structure that we learn using the entire sushi dataset. Since the sushi are 
not prepartitioned into distinct coalitions, it is somewhat more difflcult than 
with, say, the APA data, to interpret whether the estimated structure makes 
sense. However, parts of the tree certainly seem like reasonable groupings. 
For example, all of the tuna related sushi types have been clustered together. 
Tamago and kappa-maki (egg and cucumber rolls) are "safer", typically more 
boring choices, while uni and sake (sea urchin and salmon roe) are more 
daring. Anago (sea eel), is the odd man out in the estimated hierarchy. 
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Fig 15. Sushi preference dataset: exact first-order marginals and riffle independent ap- 
proximation 



{1,2,3,4,5,6,7,8,9,10} 
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(a) Stability of bootstrapped tree (b) Learned hierarchy for sushi dataset using 
'features' of the sushi dataset all 5000 rankings 

Fig 16. Structure discovery experiments: Sushi preference dataset 



being partitioned away from the remaining items at the top of the tree. 

To understand the behavior of our algorithm with smaher sample sizes, 
we looked for features of the tree from Figure 16(b) which remained stable 
even when learning with smaller sample sizes. Figure 16(a) summarizes the 
results of our bootstrap analysis for the sushi dataset, in which we resample 
from the original training set 200 times at each of different sample sizes 
and plot the proportion of learned hierarchies which, (a) recover 'sea eel' 
as the topmost partition, (b) recover all leaf sets correctly, (c), recover the 
entire tree correctly, (d) recover the tuna-related sushi leaf set, (e) recover 
the {tamago, kappa-maki} leaf set, and (f) recover the {uni, sake} leaf set. 
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(a) The Meath constituency 
in Ireland, shown in green, 
was one of three constituen- 
cies to have electronic vot- 
ing in 2002. (Map from 
Wikipedia) 
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Bruton, J. 


Fine Gael 
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Colwell, J. 


Independent 
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Dempsey, N. 


Fianna Fail 


5 


English, D. 


Fine Gael 


6 


Farrelly, J. 


Fine Gael 


7 


Fitzgerald, B. 


Independent 


8 


Kelly, T. 


Independent 
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O'Brien, P. 


Independent 


10 


O'Byrne, F. 


Green Party 


11 


Redmond, M. 


Christian Solidarity 


12 


Reilly, J. 


Sinn Fein 


13 


Wallace, M. 


Fianna Fail 


14 


Ward, P. 


Labour 



(b) List of candidates from the Meath con- 
stituency election in 2002 for five seats in the 
Dail Eireann (reproduced from Gormley and 
Murphy [2006]) 



Fig 17. Irish election dataset summary 



11.3. Data analysis: Irish election data. We next applied our algorithms 
to a larger Irish House of Parliament (Dail Eireann) election dataset from 
the Meath constituency in Ireland (Figure 17(a)). The Dail Eireann uses the 
single transferable vote (STV) election system, in which voters rank a subset 
of candidates. In the Meath constituency, there were 14 candidates in the 
2002 election, running for five allotted seats. The candidates identified with 
the two major rival political parties, Fianna Fail and Fine Gael, as well as a 
number of smaller parties (Figure 17(b)). See Gormley and Murphy [2006] for 
more election details (including candidate names) as well as an alternative 
analysis. In our experiments, we used a subset of roughly 2500 fully ranked 
ballots from the election. 

To summarize the dataset. Figure 18(a) shows the matrix of first-order 
marginals estimated from the dataset. Candidates {1, 2, 4, 5, 6, 13} form the 
set of "major" party candidates belonging to either Fianna Fail or Fine Gael, 
and as shown in the figure, fared much better in the election than the other 
seven minor party candidates. Notably, candidates 11 and 12 (belonging to 
the Christian Solidary Party and Sinn Fein, respectively) received on average, 
the lowest ranks in the 2002 election. One of the differences between the two 
candidates, however, is that a significant portion of the electorate also ranked 
the Sinn Fein candidate very high. 
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Fig 18. Irish Election dataset: exact first- order marginals and riffle independent approxi- 
mations 



{1 ,2,3,4,5,6,7,8,9,1 0,1 1 ,1 2,1 3,1 4} 



{1,2,3,4,5,6,7,8,9,10,11,13,14} {12} 

Sinn Fein 



{11} {1,2,3,4,5,6,7,8,9,10,13,14} 

Cliristian " 
Solidarity 

{1,4,13} {2,3,5,6,7,8,9,10,14} 
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Independent Green 



{14} 

Labour 



Fig 19. Learned hierarchy for Irish Election dataset using all 2500 ballots 



Though it may not necessarily be clear how one might partition the candi- 
dates, a natural idea might be to assume that the major party candidates [A) 
are riffle independent of the minor party candidates {B). In Figure 18(b), we 
show the first-order marginals corresponding to an approximation in which 
A and B are assumed to be riffle independent. Visually, the approximate 
first-order marginals can be seen to be roughly similar to the exact first- 
order marginals, however there are significant features of the matrix which 
are not captured by the approximation — for example, the columns belong- 
ing to candidates 11 and 12 are not well approximated. In Figure 18(c), we 
plot a more principled approximation corresponding to a learned hierarchy, 
which we discuss next. As can be seen, the first-order marginals obtained via 
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Fig 20. Structure Discovery Experiments: Irish Election dataset 



structure learning is visually much closer to the exact marginals. 

Structure discovery on the Irish election data. As with the APA data, both 
the exhaustive optimization of J-' and the Anchors algorithm returned the 
same tree, with running times of 69.7 seconds and 2.1 seconds respectively 
(not including the 3.1 seconds required for precomputing mutual informa- 
tions). The resulting tree, with candidates enumerated alphabetically from 
1 through 14, is shown (only up to depth 4), in Figure 19. As expected, the 
candidates belonging to the two major parties, Fianna Fail and Fine Gael, 
are neatly partitioned into their own leaf sets. The topmost leaf is the Sinn 
Fein candidate, indicating that voters tended to insert him into the ranking 
independently of all of the other 13 candidates. 

To understand the behavior of our algorithm with smaller sample sizes, 
we looked for features of the tree from Figure 19 which remained stable even 
when learning with smaller sample sizes. In Figure 20(a), we resampled from 
the original training set 200 times at different sample sizes and plot the 
proportion of learned hierarchies which, (a) recover the Sinn Fein candidate 
as the topmost leaf, (b) partition the two major parties into leaf sets, and 
(c) agree with the original tree on all leaf sets, and (d) recover the entire 
tree. Note that while the dataset is insufficient to support the entire tree 
structure, even with about 100 training examples, candidates belonging to 
the major parties are consistently grouped together indicating strong party 
influence in voting behavior. 

We compared the results between learning a general hierarchy (without 
fixed k) and learning a 1-thin chain model on the Irish data. Figure 20(b) 
shows the log-likelihoods achieved by both models on a held-out test set as 
the training set size increases. For each training set size, we subsampled the 
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^/'S^Green 

{2,4} {6,9} 

Labour, Fine Gael Sinn Fein, Socialist 



(b) Dublin West 



Fig 21. Learned hierarchies for Irish election data from Dublin (north) and Dublin (west) 
constituencies 



Irish dataset 100 times to produce confidence intervals. Again, even with 
smah sample sizes, the hierarchy outperforms the 1-chain and continually 
improves with more and more training data. One might think that the hi- 
erarchical models, which use more parameters are prone to overfitting, but 
in practice, the models learned by our algorithm devote most of the extra 
parameters towards modeling the correlations among the two major parties. 
As our results suggest, such intraparty ranking correlations are crucial for 
achieving good modeling performance. 

Finally, we ran our structure learning algorithm on two similar but smaller 
election datasets from the other constituencies in the 2002 election which 
supported electronic voting, the Dublin North and West constituencies. Fig- 
ure 21 shows the resulting hierarchies learned from each dataset. As with the 
Meath constituency, the Fianna Fail and Fine Gael are consistently grouped 
together in leaf sets in the Dublin datasets. Interestingly, the Sinn Fein and 
Socialist parties are also consistently grouped in the Dublin datasets, poten- 
tially indicating some latent similarities between the two parties. 

12. Conclusions. Exploiting independence structure for efficient in- 
ference and low sample complexity is a simple yet powerful idea, perva- 
sive throughout the machine learning literature, showing up in the form of 
Bayesian networks, Markov random fields, and more. For rankings, indepen- 
dence can be problematic due to mutual exlusivity constraints, and we began 
our paper by indicating a need for a useful generalization of independence. 

The main contribution of our paper is the definition of such a generalized 
notion, namely, riffied independence. There are a number of natural questions 
that immediately follow any such definition, such as: 
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• Does the generalization retain any of the computational advantages of 
probabilistic independence? 

• Can we find evidence that such generalized independence relations hold 
(or approximately hold) in real datasets? 

• If subsets of items in a ranking dataset indeed satisfy the generalized 
independence assumption, or approximately so, how could we algorith- 
mically determine what these subsets should be from samples? 

We have shown that for riffled independence, the answer to each of the 
above questions lies in the affirmative. We next explored hierarchical riffle 
independent decompositions. Our model, in which riffle independent subsets 
are recursively chained together, leads to a simple, interpretable model whose 
structure we can estimate from data, and we have successfully applied our 
learning algorithms to several real datasets. 

Currently, the success of our structure learning methods depends on the 
existence of a fairly sizeable dataset of full rankings. However, ranking datasets 
are more typically composed of partial or incomplete rankings, which are of- 
ten far easier to elicit from a multitude of users. For example, top-k type 
rankings, or even rating data (in which a user/judge provides a rating of 
an item between, say, 1 and 5) are common. Extending our parameter and 
structure learning algorithms for handling such partially ranked data would 
be a valuable and practical extension of our work. For structure learning, 
our tripletwise mutual information measures can already potentially be esti- 
mated within a top-A; ranking setting. It would be interesting to also develop 
methods for estimating these mutual information measures from other forms 
of partial rankings. Additionally, the effect of using partial rankings on struc- 
ture learning sample complexity is not yet understood, and the field would 
benefit from a careful analysis. 

Many other possible extensions are possible. In our paper, we have devel- 
oped algorithms for estimating maximum likelihood parameters. For small 
training set sizes, a Bayesian approach would be more appropriate, where 
a prior is placed on the parameter space. However, if the prior distribution 
ties parameters together (i.e., if the prior does not factor across parameters), 
then the structure learning problem can be considerably more complicated, 
since we would not be able to simply identify independence relations. 

Riffled independence is a new tool for analyzing ranked data and as we 
have shown, has the potential to give new insights into ranking datasets. 
We strongly believe that it will be crucial in developing fast and efficient 
inference and learning procedures for ranking data, and perhaps other forms 
of permutation data. 
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APPENDIX B: MORE PROOFS AND DISCUSSION 
B.l. Proof of Lemma 14. 

Proof. Let i be an item in A (with 1 ^ i ^ p). Since 0a (^) ^ Sp^ 
[(/)A(a)](z) is some number between 1 and p. By definition, for any j G 
{1, . . . ,p} the interleaving map [TA,B(^)](j) returns the j^^ largest rank in 
(j{A). Thus, [rA,B{o-)]{(j)A{i)) is the (/)A(z)-th largest rank in (j{A), which 
is simply the absolute rank of item i. Therefore, we conclude that a{i) = 
[rA,B{cr)]{(f)A{i))- Similarly, ifp+1 < i < n, we have a (z) = [rA,B{cr)]{(f)B{i) + 
p) (the added p is necessary since the indices of B are offset by p in a), and 
we can conclude that a = TA,B(o-)[(j)A(o-) (j)B(o-)]. □ 

B.2. Log-likelihood interpretations. If we examine the KL diver- 
gence objective introduced in Section 7, it is a standard fact that minimiz- 
ing Equation 7.1 is equivalent to find the structure which maximizes the 
log-likelihood of the training data. 



In the above (with some abuse of notation), m, / and g are estimated using 
counts from the training data. The equivalence is significant because it jus- 
tifies structure learning for data which is not necessarily generated from a 
distribution which factors into riffle independent components. Using a similar 
manipulation, we can rewrite our objective function (Equation 7.2) as: 



^[A B] = DklCH^) \\m{rA,B{a))f{M^))9(<t>B{a))), 




— const, 



{m{TA,B{(T))f{(l)A{(T))g{(l)B{(T))) , 



— const, 
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which we can see to be a "composite" of two hkehhood functions. We are 
evaluating our data log-hkehhood first under a model in which the absolute 
ranks of items in A are independent of relative ranks of items in and 
secondly under a model in which the absolute ranks of items in B are inde- 
pendent of relative ranks of items in A. Here again, and '0^ are estimated 
using counts of the training data. We see that if these distributions, i/ja^ "^b 
factor along their inputs, then optimizing the objective function is equiva- 
lent to optimizing the likelihood under the riffle independent model. Thus, if 
the data is already riffle independent (or nearly riffle independent), then the 
structure learning objective can indeed to be interpreted as maximizing the 
log-likelihood of the data, but otherwise there does not seem to be a clear 
equivalence between the two objective functions. 

B.3. Why testing for independence of relative ranks is insuffi- 
cient. Why can we not just check to see that the relative ranks of A are 
independent of the relative ranks of B? Another natural objective function 
for detecting riffle independent subsets is: 



Equation B.l is certainly a necessary condition for subsets A and B to be 
riffle independent but why would it not be sufflcient? It is easy to construct 
a counterexample — simply find a distribution in which the interleaving 
depends on either of the relative rankings. 

Example 37. In this example, we will consider a distribution on S4. 
Let A = {1,2} and B = {3,4}. To generate rankings a G S/^, we will draw 
independent relative rankings, a a ctnd as, with uniform probability for each 
of A and B. Then set the interleaving as follows: 



Finally set a — r • [cr^, cfb]- 

Since the relative rankings are independent, J^[A^B] — 0. But since the 
interleaving depends on the relative ranking of items in A, we see that A and 
B are not riffle independent in this example. 



(B.l) 



F[A,B] = i{(kA{<y)-(kB{<y))- 




[AABB\ if {1,2) 
{BBAA^ otherwise 
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APPENDIX C: MANIPULATING INTERLEAVING DISTRIBUTIONS 

IN THE FOURIER DOMAIN 

APPENDIX D: PROOF OF RECURRENCE 

Theorem 38. Algorithm 1 returns a uniformly distributed q) -interleaving. 

Proof. The proof is by induction on n = p + The base case (when 
n = 1) is obvious since the algorithm can only return a single permutation. 

Next, we assume for the sake of induction that for any m < n, the algo- 
rithm returns a uniformly distributed interleaving and we want to show this 
to also be the case for n. 

Let r be any interleaving in Q^p^q. We will show that m(r) = l/Q)- Con- 
sider T~ = T~^{1 : n — 1). There are two cases: is either a {p^q — 1)- 
interleaving (in which case T(n) = n), or a (p — 1, g')-inter leaving (in which 
case r{p) = n). 

We will just consider the first case since the second is similar. is uni- 
formly distributed by the inductive hypothesis and therefore has probability 

r ^{n) is set to n independently with probability q/n^ so we compute the 
probability of the interleaving resulting from the algorithm as: 

n — p 1 n — p p\{n — 1 — p)\ p\{n — p)\ 1 
n ' (^-1) ^ ~^ (n-1)! ^ n\ ^ "g) ' 

□ 



Fourier transforming the biased riffle shuffle. 



satisfied by rn^^q^ 



terms of m 



unif 



allowing one to write m^^q^ ^ 



and m, 



unif 



We describe the recurrence 
a distribution on Sn^ in 
rpq_i ciii'a '""p-iq^ distributions over Sn-i (see Algorithm 1 in 
main paper). Given a function / : Sn-i we will define the embedded 

function / t;j_i: 5^ ^ R by / t^J-i (c^) = /(^^i, • • • , ^r^-i) if cr(n) = n, and 
otherwise. Algorithm 1 can be then rephrased as a recurrence relation as 
follows. 



Proposition 39. The uniform riffle shuffling distribution m^^q^ obeys 
the recurrence relation: 



(D.l) 



unif 



p + q 



unif xn , r 
• tn-1 *<)(p+l,...,n) 



\p + q 



''''p,q-l I n-1 



with base cases: m^^^ = rn^Q-^ = 6^, where is the delta function at the 
identity permutation. 
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RiffleHat(;), q) 
n ^ p -\- q ; 

Initialize mP'^^'^ ^ j^curr arrays of p + 1 Fourier transform data structures ; 
for i = 1, 2, 72 do 

for j = max(0, p — n + i), . . . , min(i, p) do 
if j == or j == i then 

end 
else 

+ (^^) CONVOLVE(EMBED(mP^^^[j - 1] , i - 1 , i) , ?"(^,^_ 1 , . . . J) ) ; 

end 

^prev ^ ^curr . 

end 

end 

return rh^'^^^[p]; 

Algorithm 5: Pseudocode for computing the Fourier transform of the uniform riffle 
shuffling distribution using dynamic programming. 



Note that by taking the support sizes of each of the functions in the 
above recurrence, we recover the fohowing well known recurrence for binomial 
coefficients: 



(D.2) 



+ 



n-1 
V 



with base case 



The biased riffle shuffle is defined by: 



(D.3) 



(p+l,...,n) 



+ 



^1 - o^)q 

p + q 



unif 



1 I n-1 



Writing the recursion in the form of Equation D.l provides a construction 
of the uniform riffie shuffie as a sequence of operations on smaller distribu- 
tion which can be performed completely with respect to Fourier coefficients. 
In particular, given the Fourier coefficients of a function / : Sn-i one 
can construct the Fourier coefficients of the embedding / tn-i applying 
the branching rule (see Kondor and Borgwardt [2008]; Sagan [2001] for de- 
tails). Using the linearity property, the Convolution Theorem 20 and the fact 
that embeddings can be performed in the Fourier domain, we arrive at the 
equivalent Fourier-theoretic recurrence for each frequency level i. 



(D.4) 



unif 



unif 



I n- 



■ pi{p + 1, . . . , n) + 





unif 


\P + qJ 





where pi is the i^^ irreducible representation matrix evaluated at the cy- 
cle (p+ (see Huang, Guestrin and Guibas [2009b] for details on 
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Fig 22. The flow of information in Algorithm 5 hears much resemblance to Pascal's tri- 
angle for computing binomial coefficients. The arrows in this diagram indicate the Fourier 
transforms that must be precomputed before computing the Fourier transform of a larger 
interleaving distribution. For example, to compute mi, 2; one must first compute mi,i and 
mo, 2. In blue, we have outlined the collection of Fourier transforms that are computed by 
Algorithm 5 while computing mi, 3 



irreducible representations). Implementing the recurrence (Equation D.4) in 
code can naively result in an exponential time algorithm if one is not care- 
ful. It is necessary to use dynamic programming to be sure not to recompute 
things that were already computed. In Algorithm 5, we present pseudocode 
of such a dynamic programming approach, which builds a 'Pascal's triangle' 
similar to that which might be constructed to compute a table of binomial 
coefficients. The pseudocode assumes the existence of Fourier domain algo- 
rithms for convolving distributions and for embedding a distribution over 
Sn-i into Sri' See Figure 22 for a graphical illustration of the algorithm. 

D.l. Sample complexity analysis. 

Lemma 40 (adapted from Hoffgen [1993]). The entropy of a discrete 
random variable with arity R can be estimated to within accuracy A with 
probability 1 — (3 using O log^ ^ log i.i.d samples and the same time. 

Lemma 41. The collection of mutual informations h-^j^k can be estimated 
to within accuracy A for all triplets (i^j^k) with probability at least 1 — 7 

using S{A, 7) = O log^ ^ log i.i.d. samples and the same amount of 
time. 
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Proof. Fix a < 7 < 1 and A. For any fixed triplet (z, j, fc), Hoffgen's 
result (Lemma 40) implies that H(ai;aj < Uk) can be estimated with accu- 
racy A with probability at least 1 — using O log^ ^ log 

samples since the variable {(Ji,aj < a^) has arity 2n and setting /3 = 

Estimating the mutual information for the same triplet therefore requires 
the same sample complexity by the expansion: li-j^j^ = H{(Ji) + H{(jj < 
Uk) — H{ai]aj < cT/e). Now we use a simple union bound to bound the 
probability that the collection of mutual informations over all triplets is 
estimated to within A accuracy. Define A^ j = li-j^k ~ h]j,k- 

P(|A,,,,fe| < A, V(i,j,A:)) > 1- ^P(|A,,,,fc| > A) > 1 - • J > 1 - 7. 

□ 

Lemma 42. Fix k < n/2. and let A be a k-subset of {1, . . . ,n} with A 
riffle independent of its complement B. Let A' be a k-subset with A! ^ A or 
B. If A and B are each e-third order strongly connected, we have J-'{A^) = 
F{B') > tl^(n^ k) • e, where tl^(n^ k) = {n — k){n — 2k). 

Proof. Let us first establish some notation. Given a subset X C {1, . . . , n}, 
define 

Thus fi^^ and fi^^ are the sets of triplets whose indices are all internal to A 
or internal to B respectively. We define Vt^^f^^^, to be the set of triplets which 
"cross" between the sets A and B\ 

Q.'X^B' = {{x'lV^^) • X ^ A,y,z ^ B, OTX ^ B,y,z ^ A}. 

The goal of this proof is to use the strong connectivity assumptions to 
lower bound J^{A'). In particular, due to strong connectivity, each triplet 
inside Vt^^f^^, that also lies in either Q}^^ or fi^^ must contribute at least e to 
the objective function J^[A'). It therefore suffices to lower bound the number 
of triplets which cross between A' and B' ^ but are internal to either A or B 
(i.e., ^ (^A^ U ^B^)\)- Define l=\An A'\ and note that < ^ < /c. 

It is straightforward to check that: 1^4 H — k — \B r] A'\ — k — and 
\B^B'\^{n-k)-{k-l)^n + l- 2k. 

\(-^cross ^ / f^int , , r^intw \(-^cross ^ (-^int\ , \(-^cross ^ (-^int\ 

> £{k - if +f{k-i) + {k-i){n + i- 2kf + (n + ^ - 2k){k - if, 
>{k-i) ((n - k){n - 2k) + in) , 

> k {{n - k){n - 2k) + kn) . 
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We do want the bound above to depend on t. Intuitively, for a fixed k and 
n, the above expression is minimized when either ^ = or A: — 1 (a more 
formal argument is shown below in the proof of Lemma 43). Plugging ^ = 
and k — 1 and bounding from below yields: 

\Q.a"b' n {Q^T U > min {k{n - k){n - 2k), {n - k){n - 2k) + n{k - 1)]) , 

> {n-k){n-2k). 

Finally due to strong connectivity, we know that for each triplet in ri^^Ufi^^ 
we have Ix;y,z > thus each edge in Qf^^f^^^, H (fi^^ U fi^^) contributes at 
least e to J^{A^)^ establishing the desired result. □ 

Lemma 43. Under the same assumptions as Lemma 42, p{n, k,£) — {k — 
i) {{n — k){n — 2k) + In) is minimized at either i = or k — 1. 

Proof. Let a = (n - k){n — 2k). We know that a > since k < n/2 
by assumption (and equals zero only when k — n/2). We want to find the 
^ G {0, . . . , A: — 1} which minimizes the concave quadratic function p{t) — 
{k — tj{a ^ £n)^ the roots of which are I — k and I — —a/n (note that 
—a/n < 0. The minimizer is thus the element of {0, . . . ,/c — 1} which is 
closest to either of the roots. □ 

Theorem 44. Let A be a k-subset o/{l, . . . , n} with A riffle independent 
of its complement B. If A and B are each e-third order strongly connected, 
then given S{A, e) = O log^ ^ log i.i.d. samples, the minimum of P 

(evaluated over all k-subsets of {1, ... ,n}) is achieved at exactly the subsets 
A and B with probability at least 1 — 7. 

Proof. Let A' be a fc-subset with A' ^ A or B. Our goal is to show that 
P{A') >P{A). 

Denote the error between estimated mutual information and true mutual 
information by ^i-^j^k = h]j,k ~ h]j,k- We have: 

Ha') - HA) = E - f E j 

^H^)-Ha)+ J2 E 

>'0(n, A:)-e+ ^ ^i;j,k - ^ ^i;j,k 
(by Lemma 42 and J'{A) = 0) 
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Now assume that all of the estimation errors A are uniformly bounded as: 
(D.5) |A.,„|<^^ ^("''^^ 



4 \n'^k — k'^n 

And note that \ = {nj^^^'l = k^{n - k) + k{n - kf = n^k - k^n. We 

have: 

^ eip(n, k) 
- 2 
< e • ip{n, k) 

Combining this bound on the estimation errors with the bound on J^{A^) 
^{A) yields: 

/ 



f{A')-f{A) > ei;{n,k) - 

^ e^{n,k) 
2 

>0, 



which is almost what we want to show. How many samples do we require to 
achieve the bound assumed in Equation D.5 with high probability? Observe 

that the bound simplifies as, 

e f '0(^5 k) \ e f [n — k){n — 2k) \ e f n — 2k^ 



4 yn^k — k'^n J 4 \ nk(n — k) ) 4 \ n/c 

which behaves like O (e) when k is 0(1), but like O (^) when k is 0[n). 
Applying the sample complexity result of Lemma 41 with A = 0{e/n)^ we 

(4 o 2 4 \ 

log ^ log Y ) i-i-d. samples, the bound in Equation D.5 

holds with probability 1 — 7, concluding the proof. □ 

REFERENCES 

Arrow, K. (1963). Social Choice and Individual Values. Yale University Press. 

Bach, F. R. and Jordan, M. L (2001). Thin Junction Trees. In Advances in Neural 

Information Processing Systems 14 569-576. MIT Press. 
Bayer, D. and Diaconis, P. (1992). Trailing the Dovetail Shuffle to its Lair. The Annals 

of Probability. 

Chechetka, a. and Guestrin, C. (2007). Efflcient Principled Learning of Thin Junction 
Trees. In NIPS. 

Chen, H., Branavan, S. R. K., Barzilay, R. and Karger, D. R. (2009). Global 
models of document structure using latent permutations. In NAACL 2009 371-379. 
Association for Computational Linguistics. 



64 



J. HUANG ET AL. 



Clausen, M. and Baum, U. (1993). Fast Fourier Transforms for Symmetric Groups: 
Theory and Implementation. Mathematics of Computations 61 833-847. 

DiACONis, P. (1988). Group Representations in Probability and Statistics. IMS Lecture 
Notes. 

DiACONis, P. (1989). A Generalization of Spectral Analysis with Application to Ranked 

Data. The Annals of Statistics 17 949-979. 
Efron, B. and Tibshirani, R. (1993). An introduction to the bootstrap. Chapman and 

Hall, London. 

Farias, V., Jagabathula, S. and Shah, D. (2009). A Data-Driven Approach to Mod- 
eling Choice. In Advances in Neural Information Processing Systems 22 (Y. Bengio, 
D. Schuurmans, J. Lafferty, C. K. I. Williams and A. Culotta, eds.) 504-512. 

Fligner, M. and Verducci, J. (1986). Distance-based Ranking models. Journal of the 
Royal Statistical Society, Series B 83 859-869. 

Fligner, M. and Verducci, J. (1988). Mulistage Ranking Models. Journal of the Amer- 
ican Statistical Association 83. 

FuLMAN, J. (1998). The combinatorics of biased riffle shuffles. Combinatorica 18 173-184. 

Gallo, G., Longo, G., Pallottino, S. and Nguyen, S. (1993). Directed hypergraphs 
and applications. Discrete Appl. Math. 42. 

GoRMLEY, C. and Murphy, B. (2006). A Latent Space Model for Rank Data. In ICML. 

GuiVER, J. and Snelson, E. (2009). Bayesian inference for Plackett-Luce ranking models. 
In ICML. 

Helmbold, D. p. and Warmuth, M. K. (2007). Learning Permutations with Exponen- 
tial Weights. In COLT. 

HoFFGEN, K. U. (1993). Learning and Robust Learning of Product Distributions. In 
COLT. 

Holmes, S. (1999). Phylogenies: an overview. IMA series, Statistics and Genetics 112 
81-119. 

Holmes, S. (2003). Bootstrapping phylogenetic trees: theory and methods. Statistical 
Science 18 241-255. 

Huang, J., Guestrin, C. and Guibas, L. (2007). Efficient Inference for Distributions 
on Permutations. In NIPS. 

Huang, J. and Guestrin, C. (2009a). Riffied Independence for Ranked Data. In NIPS. 

Huang, J., Guestrin, C. and Guibas, L. (2009b). Fourier theoretic probabilistic infer- 
ence over permutations. J MLR 10. 

Huang, J. and Guestrin, C. (2010). Learning Hierarchical Riffie Independent Groupings 
from Rankings. In ICML. 

Huang, J., Guestrin, C, Jiang, X. and Guibas, L. (2009). Exploiting Probabilistic 
Independence for Permutations. In AISTATS. 

Jagabathula, S. and Shah, D. (2008). Inferring rankings under constrained sensing. In 
NIPS. 

Kamishima, T. (2003). Nantonac collaborative filtering: recommendation based on order 

responses. In KDD 583-588. 
Roller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and 

Techniques. MIT Press. 
KoNDOR, R. (2008). Group Theoretical Methods in Machine Learning PhD thesis, 

Columbia University. 

KoNDOR, R. and Borgwardt, K. M. (2008). The skew spectrum of graphs. In ICML 
496-503. 

KoNDOR, R., Howard, A. and Jebara, T. (2007). Multi-Object Tracking with Repre- 
sentations of the Symmetric Group. In AISTATS. 



UNCOVERING THE RIFFLED INDEPENDENCE STRUCTURE OF RANKING965 



Lebanon, G. and Mao, Y. (2008). Non-parametric Modeling of Partially Ranked Data. 
In NIPS. 

M. Sun, K. C.-T. G. Lebanon (2010). Visualizing Differences in Web Search Algo- 
rithms using the Expected Weighted Hoeffding Distance. In Proceedings of the 19th 
International World Wide Web Conference (WWW). 

Mallows, C. (1957). Non-null ranking models. Biometrika 44 114-130. 

Marden, J. I. (1995). Analyzing and Modeling Rank Data. Chapman & Hall. 

Maslen, D. (1998). The efficient computation of Fourier transforms on the Symmetric 
group. Mathematics of Computation 67 1121-1147. 

Meila, M., Phadnis, K., Patterson, A. and Bilmes, J. (2007). Consensus ranking 
under the exponential model Technical Report No. 515. 

MoTWANi, R. and Raghavan, P. (1996). Randomized algorithms. ACM Comput. Surv. 
28. 

Petterson, J., Caetano, T., McAuley, J. and Yu, J. (2009). Exponential Family 

Graph Matching and Ranking. CoRR abs/0904.2623. 
Plackett, R. (1975). The analysis of permutations. Applied Statistics 24 193-202. 
Reid, D. (1979). An algorithm for tracking multiple targets. IEEE Trans, on Automatic 

Control 6 843-854. 

ROCKMORE, D. N. (2000). The FFT: An Algorithm the Whole Family Can Use. Com- 
puting in Science and Engineering 02 60-64. 
Sagan, B. (2001). The Symmetric Group. Springer. 

Shahaf, D., Chechetka, A. and Guestrin, C. (2009). Learning Thin Junction Trees 
via Graph Cuts. In In Artificial Intelligence and Statistics (AISTATS). 

Shi, J. and Malik, J. (2000). Normalized Cuts and Image Segmentation. IEEE PAMI 
22. 

Shin, J., Lee, N., Thrun, S. and Guibas, L. (2005). Lazy inference on object identities 
in wireless sensor networks. In IPSN. 

Terras, A. (1999). Fourier Analysis on Finite Groups and Applications. London Math- 
ematical Society. 

Thurstone, L. (1927). A law of comparative judgement. Psychological Review 34 273- 
286. 

200 Smith Hall 
Carnegie Mellon University, 
5000 Forbes Avenue, 
Pittsburgh Pennsylvania, 15213. 
E-mail: jchl@cs.cmu.edu 

guestrin@cs.cmu.edu 



